# Introduction

This guide explains in short, requirements for integrating with an existing Hadoop cluster for 3 reasons:

1. Push necessary artefacts required by TDSS to run a spark job and get metadata.
2. Have a bucket in HDFS for storing all processed / unprocessed data (TDSS Repository).
3. Utilise Resource Manager to launch a spark job, that act on above data (Reason 2.)

Below are the examples of APIs we call, to achieve above 3 steps:

1. For pushing artefacts and getting metadata (file sizes and timestamp) we use webhdfs API,
   1. format: http\://\<HOST>:\<PORT>/webhdfs/v1/file\_path?op=\<SOME\\\_COMM\_AND>

      where SOME\_COMMAND = GETFILESTATUS get metadata information (like time and size of the artefact) and SOME\_COMMAND = CREATE to push the file
   2. [Example, for getting metadata for transformation asssembly we use webhdfs url like ](http://11.0.0.226:50070/webhdfs/v1/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar_?op=GETFILESTATUS_)
      1. [ http://11.0.0.226:50070/webhdfs/v1/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar\_?op=GETFILESTATUS\_](http://11.0.0.226:50070/webhdfs/v1/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar_?op=GETFILESTATUS_)
2. TDSS spark job application will access these artefacts ( and other respository items) via Hadoop Name Node.
   1. Format: hdfs\://\<NAME\_NODE\_HOST>:\<PORT>/file\_path
   2. For example for accessing transformation assembly via name node we use name node url like
      1. hdfs\://11.0.0.226:54310/tookitaki/tdss/lib/transformation-assembly-0.1.1.jar
3. Our TDSS application will launch Spark Jobs to process data in hadoop via Hadoop YARN Resource Manager
   1. POST [http://RM\\\_HOST:PORT/ws/v1/cluster/apps/new-application](http://RM/_HOST:PORT/ws/v1/cluster/apps/new-application) (to get application id for submitting job)
   2. POST [http://RM\\\_HOST:PORT/ws/v1/cluster/apps/](http://RM/_HOST:PORT/ws/v1/cluster/apps/) (to submit a job, with json params)
   3. [http://RM\\\_HOST:PORT/ws/v1/cluster/apps/\<APPID>](http://RM/_HOST:PORT/ws/v1/cluster/apps/<APPID\&gt); (to get status of exiting job by application id)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tookitaki.gitbook.io/tdss-yarn-integration-guide/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
