Introduction

This guide explains in short, requirements for integrating with an existing Hadoop cluster for 3 reasons:

Push necessary artefacts required by TDSS to run a spark job and get metadata.
Have a bucket in HDFS for storing all processed / unprocessed data (TDSS Repository).
Utilise Resource Manager to launch a spark job, that act on above data (Reason 2.)

Below are the examples of APIs we call, to achieve above 3 steps:

For pushing artefacts and getting metadata (file sizes and timestamp) we use webhdfs API,
1. format: http://<HOST>:<PORT>/webhdfs/v1/file_path?op=<SOME\_COMM_AND>
  where SOME_COMMAND = GETFILESTATUS get metadata information (like time and size of the artefact) and SOME_COMMAND = CREATE to push the file
2. Example, for getting metadata for transformation asssembly we use webhdfs url like
  1. http://11.0.0.226:50070/webhdfs/v1/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar_?op=GETFILESTATUS_
TDSS spark job application will access these artefacts ( and other respository items) via Hadoop Name Node.
1. Format: hdfs://<NAME_NODE_HOST>:<PORT>/file_path
2. For example for accessing transformation assembly via name node we use name node url like
  1. hdfs://11.0.0.226:54310/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar
Our TDSS application will launch Spark Jobs to process data in hadoop via Hadoop YARN Resource Manager
1. POST http://RM\_HOST:PORT/ws/v1/cluster/apps/new-application (to get application id for submitting job)
2. POST http://RM\_HOST:PORT/ws/v1/cluster/apps/ (to submit a job, with json params)
3. http://RM\_HOST:PORT/ws/v1/cluster/apps/<APPID>; (to get status of exiting job by application id)

Last updated 7 years ago