Deploying Spark๐
We will look at fundamental differences between different types of cluster managers.
Where to Deploy Your Clusters to Run Spark Applications๐
There are two high-level options for deploying Spark Cluster: On-premise or public cloud.
On-Premise Cluster Deployments๐
- On Premise clusters are sometimes reasonable options for companies already having their own data centres. Although you get full hardware control from this but also becomes challenging to handle clusters of dynamic size. Data Analytics workloads are usually often elastic.
- Secondly many On Premises systems are build around their own storage system like Hadoop file system or scalable key-value store (Cassandra) requiring a lot of extra effort to have replication and extra recovery.
- All of Sparkโs cluster managers allow multiple concurrent application but YARN and Mesos have better supports for dynamic sharing and additionally support non-spark workloads.
Spark in Cloud๐
- With onset of giants like AWS, Azure, GCP providing fully managed dynamic machines to handle elastic workloads makes them suitable for Spark.
- You should always choose AWS S3, Azure Blob Storage, GCS and sping up resources dynamically for each workload effectively decoupling storage and compute.
- Another advantage is public clouds include low-cost, georeplicated storage that makes it easier to manage large amounts of data.
Cluster Managers๐
Spark supports following clusters manager : standalone clusters, Hadoop YARN, and Mesos.
Standalone Clusters๐
- lightweight platform built specifically for Apache Spark workloads. Using it, we can run multiple Spark Applications on the same cluster. Provides simple interfaces for scaling workloads.
- Main disadvantage of using it that you can only run Spark.
Starting a standalone cluster๐
- It requires provisioning the machine for doing so.
- Manual starting cluster by hand :
$SPARK_HOME/sbin/start-master.sh
: this runs master on aspark://HOST:PORT
URI. You can also find masterโs web UI, which is http://master-ip-address:8080 by default - Start the worker nodes by logging into each machine and connecting at :
$SPARK_HOME/sbin/start-slave.sh <master-spark-URI>
starting cluster using a script๐
You can configure cluster launch scripts that can automate the launch of standalone clusters. To do this, create a file called conf/slaves in your Spark directory that will contain the hostnames of all the machines on which you intend to start Spark workers, one per line. If this file does not exist, everything will launch locally. When you go to actually start the cluster, the master machine will access each of the worker machines via Secure Shell (SSH).
After we setup this file we can start and stop cluster using simple scripts in sbin
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh
$SPARK_HOME/sbin/start-slave.sh
$SPARK_HOME/sbin/start-all.sh
$SPARK_HOME/sbin/stop-master.sh
$SPARK_HOME/sbin/stop-slaves.sh
$SPARK_HOME/sbin/stop-all.sh
Standalone cluster configurations๐
- helps control everything from what happens to old files on each worker for terminated application to the workerโs core and memory resources.
Submitting applications๐
- we can submit application using
spark://
of master UI or directly on master node itself usingspark-submit
Spark on YARN๐
- Hadoop YARN is a framework for job scheduling and cluster resource management. Even though Spark is often misclassfied as a part of Hadoop Ecosystem, in reality. Spark has little to do with Hadoop.
- Spark natively supports YARN and doesnโt require Hadoop itself.
- You can run spark jobs on Hadoopโs YARN by specifying the master as the YARN in the
spark-submit
command line arguments. Just like with standalone mode, there are multiple knobs to tune it according to what you would like cluster to do.
Submitting Application๐
When submitting applications to YARN, the core difference from other deployments is that --master
will become yarn
as opposed the master node IP, as it is in standalone mode. Instead, Spark will find the YARN configuration files using the environment variable HADOOP_CONF_DIR
or YARN_CONF_DIR
.
Configuring Spark on YARN Applications๐
Hadoop Configurations๐
If you plan to read and write from HDFS using Spark, you need to include two Hadoop configuration files on Sparkโs classpath: hdfs-site.xml, which provides default behaviors for the HDFS client; and core-site.xml, which sets the default file system name. The location of these configuration files varies across Hadoop versions, but a common location is inside of /etc/hadoop/conf.
To make these files visible to Spark, set HADOOP_CONF_DIR
in $SPARK_HOME/spark-env.sh to a location containing the configuration files or as an environment variable when you go to spark-submit
your application.
Application properties for YARN๐
There are a number of Hadoop-related configurations and things that come up that largely donโt have much to do with Spark, just running or securing YARN in a way that influences how Spark runs.
Spark on Mesos๐
- Mesos was created by many of original authors of Spark
- Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
- Mesos intends to be datacenter scale-cluster manager that manages not just short lived application like Spark, but long-running applications like we application and other interfaces.
- Mesos is the heaviest-weight cluster manager, simply because you might choose this cluster manager only if your organization already has a large-scale deployment of Mesos, but it makes for a good cluster manager nonetheless.
- Mesos deprecated fine-grained mode and exclusively supports coarse-grained mode meaning that each Spark executor runs as a sinlge Mesos task.
Submitting applications๐
- you should favor cluster mode when using Mesos. Client mode requires lot of configurations in
spark-env.sh
to work with Mesos.
Now when starting applicatioin against the cluster
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.master("mesos://HOST:5050")
.appName("my app")
.config("spark.executor.uri", "<path to spark-2.2.0.tar.gz uploaded above>")
.getOrCreate()
Application Scheduling๐
Spark Application runs executor processes independently. Cluster maangers provide for scheduling across Spark Application. Within each Spark Application, multiple jobs may be running concurrently if they were submitted by different threads. Spark includes a fair scheduler to schedule resources within each application.
If multiple users need to share your cluster and run different Spark Applications, there are multiple ways to manage resources like static partitioning of resources manually or dynamic allocation can be turned on the let applications scale up and down dynamically based on their current number of pending tasks.
NOTE : for using Dynamic allocation enable following options : spark.dynamicAllocation.enabled
and spark.shuffle.service.enabled
Misc. Considerations๐
- Number and Type of Applications: YARN is great for HDFS based applications. Its not great as compute and storage is tightly coupled. Mesos improves on this bit conceptually and supports multiple types of application but it require pre-provisioning machines and requires buy-in at a much larger scale. Spark Standalone is best choice for running only Spark Workloads, its lightweight and relatively simpler to manage.
- Spark Versions : If you want to manage multiple spark version then you will need to handle it yourselves as there no easy solution.
- Logging : These are more โout of boxโ for YARN or Mesos and might need some tweaking if you are using standalone
- Maintaining a Metastore : in order to maintain metadata about your stored dataset, such as table catalog. Maintaining an Apache Hive metastore, might be something thatโs worth doing to facilitate more productive, cross-application referencing to the same datasets
- Sparkโs external Shuffle Service: typically spark stores shuffle blocks on a local disk on that node. An external shuffle service allows storing shuffle blocks so that they are available to all executors meaning we can arbitrarily kill executors and still have their shuffle outputs.