TDSS YARN Integration Guide
  • Introduction
  • Preqrequisites
  • TDSS Hadoop Artefacts setup
  • Important YARN entries in dss.properties
  • DB Entries
  • Update the api war file
Powered by GitBook
On this page

TDSS Hadoop Artefacts setup

Broadly there are these steps. Files mentioned in BOLD will be provided as part of release.

  • Create directories within tdss allocated directories

    • /<SOME-ROOT-DIR>/tdss/lib

    • /<SOME-ROOT-DIR>/tdss/spark

    • /<SOME-ROOT-DIR>/tdss/udf

  • Push relevant files to relevant directories

    • /<SOME-ROOT-DIR>/tdss/lib

      • analyzer-assembly

      • transformation-assembly

      • connectors-assembly

    • /<SOME-ROOT-DIR/tdss/spark

      • spark-1.6-hadoop2.6-assembly.jar

      • hive-site.xml

      • datanucleus-jars

      • spark-yarn.properties

    • /<SOME-ROOT-DIR>/tdss/udf

      • Egg files (pyhocon-0.3.38-py2.7.egg,pyparsing-1.5.6-py2.7.egg,JayDeBeApi-1.1.1-py2.7.egg,jpype-py2.7.egg,numpy-1.15.0-py2.7.egg,futures-3.2.0-py2.7.egg,scipy-0.19.1-py2.7.egg,scikit_learn-0.19.1-py2.7.egg,Pillow-3.3.3-py2.7.egg,six-1.10.0-py2.7.egg,statsmodels-0.8.0-py2.7.egg,pandas-0.19.2-py2.7.egg,pytz-2006p-py2.7.egg, setuptools-38.5.0-py2.7.egg)

      • pyspark zip files (pyspark.zip, py4j-0.9-src.zip)

      • udf script py files (udf driver py and driver wrapper py)

      • custom.conf

  • Get the metadata for these artefacts. (This will be used in dss.properties files by the application). These properties can be generated using get_hdfs_files_timestamp_size.sh script, by specifying correct webhdfs root context.

  • Properties generated via get metadata should be copied to dss.properties file on app server (where tomcat is running)

  • These files should be gathered (latest version) from via release of tdss build

    • analyzers-assembly

    • transformation-assembly

    • connectors-assembly

    • driver_wrapper.py

    • udf_driver.py

    • custom.conf

  • Make sure custom.conf has proper entries for envVariables.sqlDatabase, db host, port, name, user, password. Otherwise spark jobs running on the cluster will not be able to connect with tdss database (postgres or mysql)

Create tdss directories in hadoop

Extract hdp_prep.tar.gz. There are a few .sh files. (Important: Please use current custom.conf file, with appropriate entries for postgres db host user password etc.)

  • First step is to specify correct base hadoop directory and webhdfs host in init_tdss_dir.sh.

    • BASE_PATH=<BASE HADOOP DIR>

    • MISC_SUFFIX="&user.name=ec2-user" # Optional, specify your user name for hadoop dir access

    • Thats it. Only three entries needed in init_tdss_dir.sh

  • Run ./create_tdss_dirs.sh # This will create 3 dirs under <BASE PATH>/ specified at beginning of this chapter.

  • Run ./push_tdss_files.sh # This will push assembly libs

  • Run ./push_spark_files.sh # This will push spark files

  • Run ./push_udf_files.sh # This will push udf files

  • Run ./get_hdfs_files_timestamp_size.sh # This will generate timestamp and size properties for all artefacts. This needs to be copied and pasted in dss.properties on application server.

PreviousPreqrequisitesNextImportant YARN entries in dss.properties

Last updated 7 years ago

WEBHDFS_HOST=

http://hklpadhas02.hk.standardchartered.com:50070