TDSS Hadoop Artefacts setup

Broadly there are these steps. Files mentioned in BOLD will be provided as part of release.

Create directories within tdss allocated directories
- /<SOME-ROOT-DIR>/tdss/lib
- /<SOME-ROOT-DIR>/tdss/spark
- /<SOME-ROOT-DIR>/tdss/udf
Push relevant files to relevant directories
- /<SOME-ROOT-DIR>/tdss/lib
  - analyzer-assembly
  - transformation-assembly
  - connectors-assembly
- /<SOME-ROOT-DIR/tdss/spark
  - spark-1.6-hadoop2.6-assembly.jar
  - hive-site.xml
  - datanucleus-jars
  - spark-yarn.properties
- /<SOME-ROOT-DIR>/tdss/udf
  - Egg files (pyhocon-0.3.38-py2.7.egg,pyparsing-1.5.6-py2.7.egg,JayDeBeApi-1.1.1-py2.7.egg,jpype-py2.7.egg,numpy-1.15.0-py2.7.egg,futures-3.2.0-py2.7.egg,scipy-0.19.1-py2.7.egg,scikit_learn-0.19.1-py2.7.egg,Pillow-3.3.3-py2.7.egg,six-1.10.0-py2.7.egg,statsmodels-0.8.0-py2.7.egg,pandas-0.19.2-py2.7.egg,pytz-2006p-py2.7.egg, setuptools-38.5.0-py2.7.egg)
  - pyspark zip files (pyspark.zip, py4j-0.9-src.zip)
  - udf script py files (udf driver py and driver wrapper py)
  - custom.conf
Get the metadata for these artefacts. (This will be used in dss.properties files by the application). These properties can be generated using get_hdfs_files_timestamp_size.sh script, by specifying correct webhdfs root context.
Properties generated via get metadata should be copied to dss.properties file on app server (where tomcat is running)
These files should be gathered (latest version) from via release of tdss build
- analyzers-assembly
- transformation-assembly
- connectors-assembly
- driver_wrapper.py
- udf_driver.py
- custom.conf
Make sure custom.conf has proper entries for envVariables.sqlDatabase, db host, port, name, user, password. Otherwise spark jobs running on the cluster will not be able to connect with tdss database (postgres or mysql)

Create tdss directories in hadoop

Extract hdp_prep.tar.gz. There are a few .sh files. (Important: Please use current custom.conf file, with appropriate entries for postgres db host user password etc.)

First step is to specify correct base hadoop directory and webhdfs host in init_tdss_dir.sh.
- BASE_PATH=<BASE HADOOP DIR>
- MISC_SUFFIX="&user.name=ec2-user" # Optional, specify your user name for hadoop dir access
- WEBHDFS_HOST=http://hklpadhas02.hk.standardchartered.com:50070
- Thats it. Only three entries needed in init_tdss_dir.sh
Run ./create_tdss_dirs.sh # This will create 3 dirs under <BASE PATH>/ specified at beginning of this chapter.
Run ./push_tdss_files.sh # This will push assembly libs
Run ./push_spark_files.sh # This will push spark files
Run ./push_udf_files.sh # This will push udf files
Run ./get_hdfs_files_timestamp_size.sh # This will generate timestamp and size properties for all artefacts. This needs to be copied and pasted in dss.properties on application server.

PreviousPreqrequisites NextImportant YARN entries in dss.properties

Last updated 7 years ago