Deploying Spark🔗

We will look at fundamental differences between different types of cluster managers.

Where to Deploy Your Clusters to Run Spark Applications🔗

There are two high-level options for deploying Spark Cluster: On-premise or public cloud.

On-Premise Cluster Deployments🔗

On Premise clusters are sometimes reasonable options for companies already having their own data centres. Although you get full hardware control from this but also becomes challenging to handle clusters of dynamic size. Data Analytics workloads are usually often elastic.
Secondly many On Premises systems are build around their own storage system like Hadoop file system or scalable key-value store (Cassandra) requiring a lot of extra effort to have replication and extra recovery.
All of Spark’s cluster managers allow multiple concurrent application but YARN and Mesos have better supports for dynamic sharing and additionally support non-spark workloads.

Spark in Cloud🔗

With onset of giants like AWS, Azure, GCP providing fully managed dynamic machines to handle elastic workloads makes them suitable for Spark.
You should always choose AWS S3, Azure Blob Storage, GCS and sping up resources dynamically for each workload effectively decoupling storage and compute.
Another advantage is public clouds include low-cost, georeplicated storage that makes it easier to manage large amounts of data.

Cluster Managers🔗

Spark supports following clusters manager : standalone clusters, Hadoop YARN, and Mesos.

Standalone Clusters🔗

lightweight platform built specifically for Apache Spark workloads. Using it, we can run multiple Spark Applications on the same cluster. Provides simple interfaces for scaling workloads.
Main disadvantage of using it that you can only run Spark.

Starting a standalone cluster🔗

It requires provisioning the machine for doing so.
Manual starting cluster by hand : $SPARK_HOME/sbin/start-master.sh : this runs master on a spark://HOST:PORT URI. You can also find master’s web UI, which is http://master-ip-address:8080 by default
Start the worker nodes by logging into each machine and connecting at : $SPARK_HOME/sbin/start-slave.sh <master-spark-URI>

starting cluster using a script🔗

You can configure cluster launch scripts that can automate the launch of standalone clusters. To do this, create a file called conf/slaves in your Spark directory that will contain the hostnames of all the machines on which you intend to start Spark workers, one per line. If this file does not exist, everything will launch locally. When you go to actually start the cluster, the master machine will access each of the worker machines via Secure Shell (SSH).

After we setup this file we can start and stop cluster using simple scripts in sbin

$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh
$SPARK_HOME/sbin/start-slave.sh
$SPARK_HOME/sbin/start-all.sh
$SPARK_HOME/sbin/stop-master.sh
$SPARK_HOME/sbin/stop-slaves.sh
$SPARK_HOME/sbin/stop-all.sh

Standalone cluster configurations🔗

helps control everything from what happens to old files on each worker for terminated application to the worker’s core and memory resources.

Submitting applications🔗

we can submit application using spark:// of master UI or directly on master node itself using spark-submit

Spark on YARN🔗

Hadoop YARN is a framework for job scheduling and cluster resource management. Even though Spark is often misclassfied as a part of Hadoop Ecosystem, in reality. Spark has little to do with Hadoop.
Spark natively supports YARN and doesn’t require Hadoop itself.
You can run spark jobs on Hadoop’s YARN by specifying the master as the YARN in the spark-submit command line arguments. Just like with standalone mode, there are multiple knobs to tune it according to what you would like cluster to do.

Submitting Application🔗

When submitting applications to YARN, the core difference from other deployments is that --master will become yarn as opposed the master node IP, as it is in standalone mode. Instead, Spark will find the YARN configuration files using the environment variable HADOOP_CONF_DIR or YARN_CONF_DIR.

Configuring Spark on YARN Applications🔗

Hadoop Configurations🔗

If you plan to read and write from HDFS using Spark, you need to include two Hadoop configuration files on Spark’s classpath: hdfs-site.xml, which provides default behaviors for the HDFS client; and core-site.xml, which sets the default file system name. The location of these configuration files varies across Hadoop versions, but a common location is inside of /etc/hadoop/conf.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files or as an environment variable when you go to spark-submit your application.

Application properties for YARN🔗

There are a number of Hadoop-related configurations and things that come up that largely don’t have much to do with Spark, just running or securing YARN in a way that influences how Spark runs.

Spark on Mesos🔗

Mesos was created by many of original authors of Spark
Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Mesos intends to be datacenter scale-cluster manager that manages not just short lived application like Spark, but long-running applications like we application and other interfaces.
Mesos is the heaviest-weight cluster manager, simply because you might choose this cluster manager only if your organization already has a large-scale deployment of Mesos, but it makes for a good cluster manager nonetheless.
Mesos deprecated fine-grained mode and exclusively supports coarse-grained mode meaning that each Spark executor runs as a sinlge Mesos task.

Submitting applications🔗

you should favor cluster mode when using Mesos. Client mode requires lot of configurations in spark-env.sh to work with Mesos.

export MESOS_NATIVE_JAVA_LIBRARY=<path to libmesos.so>

export SPARK_EXECUTOR_URI=<URL of spark-2.2.0.tar.gz uploaded above>

Now when starting applicatioin against the cluster

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
  .master("mesos://HOST:5050")
  .appName("my app")
  .config("spark.executor.uri", "<path to spark-2.2.0.tar.gz uploaded above>")
  .getOrCreate()

Application Scheduling🔗

Spark Application runs executor processes independently. Cluster maangers provide for scheduling across Spark Application. Within each Spark Application, multiple jobs may be running concurrently if they were submitted by different threads. Spark includes a fair scheduler to schedule resources within each application.

If multiple users need to share your cluster and run different Spark Applications, there are multiple ways to manage resources like static partitioning of resources manually or dynamic allocation can be turned on the let applications scale up and down dynamically based on their current number of pending tasks.

NOTE : for using Dynamic allocation enable following options : spark.dynamicAllocation.enabled and spark.shuffle.service.enabled

Misc. Considerations🔗

Number and Type of Applications: YARN is great for HDFS based applications. Its not great as compute and storage is tightly coupled. Mesos improves on this bit conceptually and supports multiple types of application but it require pre-provisioning machines and requires buy-in at a much larger scale. Spark Standalone is best choice for running only Spark Workloads, its lightweight and relatively simpler to manage.
Spark Versions : If you want to manage multiple spark version then you will need to handle it yourselves as there no easy solution.
Logging : These are more “out of box” for YARN or Mesos and might need some tweaking if you are using standalone
Maintaining a Metastore : in order to maintain metadata about your stored dataset, such as table catalog. Maintaining an Apache Hive metastore, might be something that’s worth doing to facilitate more productive, cross-application referencing to the same datasets
Spark’s external Shuffle Service: typically spark stores shuffle blocks on a local disk on that node. An external shuffle service allows storing shuffle blocks so that they are available to all executors meaning we can arbitrarily kill executors and still have their shuffle outputs.