Introduction
Last updated
Last updated
This guide explains in short, requirements for integrating with an existing Hadoop cluster for 3 reasons:
Push necessary artefacts required by TDSS to run a spark job and get metadata.
Have a bucket in HDFS for storing all processed / unprocessed data (TDSS Repository).
Utilise Resource Manager to launch a spark job, that act on above data (Reason 2.)
Below are the examples of APIs we call, to achieve above 3 steps:
For pushing artefacts and getting metadata (file sizes and timestamp) we use webhdfs API,
format: http://<HOST>:<PORT>/webhdfs/v1/file_path?op=<SOME\_COMM_AND>
where SOME_COMMAND = GETFILESTATUS get metadata information (like time and size of the artefact) and SOME_COMMAND = CREATE to push the file
TDSS spark job application will access these artefacts ( and other respository items) via Hadoop Name Node.
Format: hdfs://<NAME_NODE_HOST>:<PORT>/file_path
For example for accessing transformation assembly via name node we use name node url like
hdfs://11.0.0.226:54310/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar
Our TDSS application will launch Spark Jobs to process data in hadoop via Hadoop YARN Resource Manager
POST (to get application id for submitting job)
POST (to submit a job, with json params)
; (to get status of exiting job by application id)