Introduction

This guide explains in short, requirements for integrating with an existing Hadoop cluster for 3 reasons:

  1. Push necessary artefacts required by TDSS to run a spark job and get metadata.

  2. Have a bucket in HDFS for storing all processed / unprocessed data (TDSS Repository).

  3. Utilise Resource Manager to launch a spark job, that act on above data (Reason 2.)

Below are the examples of APIs we call, to achieve above 3 steps:

  1. For pushing artefacts and getting metadata (file sizes and timestamp) we use webhdfs API,

    1. format: http://<HOST>:<PORT>/webhdfs/v1/file_path?op=<SOME\_COMM_AND>

      where SOME_COMMAND = GETFILESTATUS get metadata information (like time and size of the artefact) and SOME_COMMAND = CREATE to push the file

  2. TDSS spark job application will access these artefacts ( and other respository items) via Hadoop Name Node.

    1. Format: hdfs://<NAME_NODE_HOST>:<PORT>/file_path

    2. For example for accessing transformation assembly via name node we use name node url like

      1. hdfs://11.0.0.226:54310/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar

  3. Our TDSS application will launch Spark Jobs to process data in hadoop via Hadoop YARN Resource Manager

    1. POST http://RM\_HOST:PORT/ws/v1/cluster/apps/new-application (to get application id for submitting job)

    2. POST http://RM\_HOST:PORT/ws/v1/cluster/apps/ (to submit a job, with json params)

    3. http://RM\_HOST:PORT/ws/v1/cluster/apps/<APPID>; (to get status of exiting job by application id)

Last updated