> For the complete documentation index, see [llms.txt](https://tookitaki.gitbook.io/tdss-yarn-integration-guide/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://tookitaki.gitbook.io/tdss-yarn-integration-guide/tdss-hadoop-artefacts-setup.md).

# TDSS Hadoop Artefacts setup

Broadly there are these steps. Files mentioned in BOLD will be provided as part of release.

* Create directories within tdss allocated directories
  * /\<SOME-ROOT-DIR>/tdss/lib
  * /\<SOME-ROOT-DIR>/tdss/spark
  * /\<SOME-ROOT-DIR>/tdss/udf
* Push relevant files to relevant directories
  * /\<SOME-ROOT-DIR>/tdss/lib
    * **analyzer-assembly**
    * **transformation-assembly**
    * **connectors-assembly**
  * /\<SOME-ROOT-DIR/tdss/spark
    * spark-1.6-hadoop2.6-assembly.jar
    * hive-site.xml
    * datanucleus-jars
    * spark-yarn.properties
  * /\<SOME-ROOT-DIR>/tdss/udf
    * Egg files (pyhocon-0.3.38-py2.7.egg,pyparsing-1.5.6-py2.7.egg,JayDeBeApi-1.1.1-py2.7.egg,jpype-py2.7.egg,numpy-1.15.0-py2.7.egg,futures-3.2.0-py2.7.egg,scipy-0.19.1-py2.7.egg,scikit\_learn-0.19.1-py2.7.egg,Pillow-3.3.3-py2.7.egg,six-1.10.0-py2.7.egg,statsmodels-0.8.0-py2.7.egg,pandas-0.19.2-py2.7.egg,pytz-2006p-py2.7.egg, setuptools-38.5.0-py2.7.egg)
    * pyspark zip files (pyspark.zip, py4j-0.9-src.zip)
    * udf script py files (udf driver py and **driver wrapper py**)
    * **custom.conf**
* Get the metadata for these artefacts. (This will be used in dss.properties files by the application). These properties can be generated using **get\_hdfs\_files\_timestamp\_size.sh** script, by specifying correct webhdfs root context.&#x20;
* Properties generated via get metadata should be copied to dss.properties file on app server (where tomcat is running)
* These files should be gathered (latest version) from via release of tdss build
  * analyzers-assembly
  * transformation-assembly
  * connectors-assembly
  * driver\_wrapper.py
  * udf\_driver.py
  * custom.conf
* Make sure custom.conf has proper entries for envVariables.sqlDatabase,  db host, port, name, user, password. Otherwise spark jobs running on the cluster will not be able to connect with tdss database (postgres or mysql)

Create tdss directories in hadoop

Extract hdp\_prep.tar.gz. There are a few .sh files. (**Important**: Please use current custom.conf file, with appropriate entries for postgres db host user password etc.)

* First step is to specify correct base hadoop directory and webhdfs host in **init\_tdss\_dir.sh**.&#x20;
  * BASE\_PATH=**\<BASE HADOOP DIR>**
  * MISC\_SUFFIX="**\&user.name=ec2-user**"   # Optional, specify your user name for hadoop dir access
  * WEBHDFS\_HOST=<http://hklpadhas02.hk.standardchartered.com:50070>
  * Thats it. Only three entries needed in **init\_tdss\_dir.sh**
* Run ./create\_tdss\_dirs.sh   # This will create 3 dirs under \<BASE PATH>/ specified at beginning of this chapter.
* Run ./push\_tdss\_files.sh     # This will push assembly libs
* Run ./push\_spark\_files.sh   # This will push spark files
* Run ./push\_udf\_files.sh       # This will push udf files
* Run ./get\_hdfs\_files\_timestamp\_size.sh   # This will generate timestamp and size properties for all artefacts. This needs to be copied and pasted in dss.properties on application server.&#x20;


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tookitaki.gitbook.io/tdss-yarn-integration-guide/tdss-hadoop-artefacts-setup.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
