Deep Learning🔗

Because deep learning is still a new field, many of the newest tools are implemented in external libraries.

What is Deep Learning ?🔗

A neural network is a graph of nodes with weights and activation functions.
Nodes are organized into layers that are stacked on top of one another, each layer is connected, either partially or completely, to previous layer in the network.
The goal is to train the network to associate certain inputs with certain outputs by tuning the weights associated with each connection and the values of each node in the network.

Deep Learning or deep neural networks, stack many of these layers together into various different architectures. This concept has existed for decades and have waxed and waned in terms of popularity for various machine learning problems.
Typical machine learning techniques typically cannot continue to perform well as more data is added; their performance hits a ceiling. Deep learning can benefit from enormous amounts of data and information.
The most popular tools are TensorFlow, MXNet, Keras, and PyTorch.

Ways of Using Deep Learning in Spark🔗

Inference : The simplest way to use deep learning is to take a pretrained model and apply it to large datasets in parallel using Spark. Using pyspark we could simply call a framework such as TensorFlow or PyTorch in a map function to get distributed inference.
Featurization and transfer Learning : The next level of complexity is to use an existing model as a featurizer instead of taking its final output. Many deep learning models learn useful feature representations in their lower layers as they get trained for an end-to-end task.
- This method is called transfer learning, and generally involves the last few layers of a pretrained model and retraining them with the data of interest. Transfer learning is also especially useful if you do not have a large amount of training data: training a full-blown network from scratch requires a dataset of hundreds of thousands of images, like ImageNet, to avoid overfitting, which
Model Training : Spark can also be used to train a new deep learning model from scratch. There are two common methods here.
- First, you can use a Spark cluster to parallelize the training of a single model over multiple servers, communicating updates between them.
- Alternatively, some libraries let the user train multiple instances of similar models in parallel to try various model architectures and hyperparameters, accelerating the model search and tuning process.

Deep Learning Libraries🔗

MLlib Neural Network Support🔗

Spark’s MLlib currently has native support for a single deep learning algorithm: the ml.classification.MultilayerPerceptronClassifier class’s multilayer perceptron classifier. This class is limited to training relatively shallow networks containing fully connected layers with the sigmoid activation function and an output layer with a softmax activation function.

This class is most useful for training the last few layers of a classification model when using transfer learning on top of an existing deep learning–based featurizer.

TensorFrames🔗

TensorFrames is an inference and transfer learning-oriented library that makes it easy to pass data between Spark DataFrames and TensorFlow.

BigDL🔗

BigDL is a distributed deep learning framework for Apache Spark primarily developed by Intel. It aims to support distributed training of large models as well as fast applications of these models using inference.

TensorFlowOnSpark🔗

TensorFlowOnSpark is a widely used library that can train TensorFlow models in a parallel fashion on Spark clusters. TensorFlow includes some foundations to do distributed training, but it still needs to rely on a cluster manager for managing the hardware and data communications. It does not come with a cluster manager or a distributed I/O layer out of the box. TensorFlowOnSpark launches TensorFlow’s existing distributed mode inside a Spark job, and automatically feeds data from Spark RDDs or DataFrames into the TensorFlow job.

DeepLearning4J🔗

DeepLearning4j is an open-source, distributed deep learning project in Java and Scala that provides both single-node and distributed training options.

Deep Learning Pipelines🔗

Deep Learning Pipelines is an open source package from Databricks that integrates deep learning functionality into Spark’s ML Pipelines API. The package existing deep learning frameworks (TensorFlow and Keras at the time of writing), but focuses on two goals:

Incorporating these frameworks into standard Spark APIs (such as ML Pipelines and Spark SQL) to make them very easy to use
Distributing all computation by default