TDSS YARN Integration Guide
  • Introduction
  • Preqrequisites
  • TDSS Hadoop Artefacts setup
  • Important YARN entries in dss.properties
  • DB Entries
  • Update the api war file
Powered by GitBook
On this page

Introduction

NextPreqrequisites

Last updated 7 years ago

This guide explains in short, requirements for integrating with an existing Hadoop cluster for 3 reasons:

  1. Push necessary artefacts required by TDSS to run a spark job and get metadata.

  2. Have a bucket in HDFS for storing all processed / unprocessed data (TDSS Repository).

  3. Utilise Resource Manager to launch a spark job, that act on above data (Reason 2.)

Below are the examples of APIs we call, to achieve above 3 steps:

  1. For pushing artefacts and getting metadata (file sizes and timestamp) we use webhdfs API,

    1. format: http://<HOST>:<PORT>/webhdfs/v1/file_path?op=<SOME\_COMM_AND>

      where SOME_COMMAND = GETFILESTATUS get metadata information (like time and size of the artefact) and SOME_COMMAND = CREATE to push the file

  2. TDSS spark job application will access these artefacts ( and other respository items) via Hadoop Name Node.

    1. Format: hdfs://<NAME_NODE_HOST>:<PORT>/file_path

    2. For example for accessing transformation assembly via name node we use name node url like

      1. hdfs://11.0.0.226:54310/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar

  3. Our TDSS application will launch Spark Jobs to process data in hadoop via Hadoop YARN Resource Manager

    1. POST (to get application id for submitting job)

    2. POST (to submit a job, with json params)

    3. ; (to get status of exiting job by application id)

Example, for getting metadata for transformation asssembly we use webhdfs url like
http://11.0.0.226:50070/webhdfs/v1/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar_?op=GETFILESTATUS_
http://RM\_HOST:PORT/ws/v1/cluster/apps/new-application
http://RM\_HOST:PORT/ws/v1/cluster/apps/
http://RM\_HOST:PORT/ws/v1/cluster/apps/<APPID>