Preqrequisites
There are a few requirements to achieve items mentioned in previous chapter. We submit spark jobs to YARN resource manager via a json input, which needs some information about
artefacts present in hadoop
Some environment information within hadoop cluster
All nodes in the cluster should have java 1.8
For running python libraries via TDSS all nodes in cluster should have python 2.7
We use python egg packages like scikit_learn-0.19.1-py2.7-linux-x86_64.egg. So it will be good if every node of cluster is based on ideally RHEL 7
Assuming when we get access to mutlitenant hdp cluster, we will be assigned a specific "QUEUE" for requesting YARN based resources. We need to know the name of that queue.
It is good to use right spark-assembly with correct hadoop version. For example if we are using hadoop version 2.6 for setting up hadoop cluster, it is good to use a spark assembly which is built with hadoop-2.6 libraries. For example spark-assembly-1.6.1-hadoop2.6.0.jar
There are two types of spark jobs we submit to yarn
Pure java / scala based: For Spark Jobs that runs pure scala code, our JSON input requires only one host information. JAVA_HOME
{
}
PySpark Python based: For python based jobs, all the nodes of the hadoop cluster should have python2.7. If the default python on each node is pointing to python2.7, then we are good to go.
Sample Json Request Made to YARN Resource Manager for submitting a spark job.
Last updated