Regression🔗

regression is a logical extension of classification. Regression is act of predictin a real number(cont. variable) from a set of features (represented as numbers).
regression can be much harder as there are infinite possible output values, we try to optimize some error metric between true value and predicted value.

Use Cases🔗

predicting moview turnout from initial responses like trailer watches, social shares etc.
predicting company revenue from bunch of business parameters
predicting crop yield based on weather

Regression Models in MLlib🔗

Linear regression
Generalized linear regression
- Available families
Decision tree regression
Random forest regression
Gradient-boosted tree regression
Survival regression
Isotonic regression
Factorization machines regressor

Model Scalability🔗

Model	Number features	Training examples
Linear regression	1 to 10 million	No limit
Generalized linear regression	4,096	No limit
Isotonic regression	N/A	Millions
Decision trees	1,000s	No limit
Random forest	10,000s	No limit
Gradient-boosted trees	1,000s	No limit
Survival regression	1 to 10 million	No limit

Data used in examples:

df = spark.read.load("/data/regression")

Linear Regression🔗

Linear Regression assumes that a linear combination of your input features results along with an amount of Gaussian error in output.
Model Hyperparameters, Training Parameters are same as previous module or refer documentation.

from pyspark.ml.regression import LinearRegression
lr = LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
print lr.explainParams()
lrModel = lr.fit(df)

Training Summary🔗

The residuals are simply the weights for each of the features that we input into the model. The objective history shows how our training is going at every iteration. The root mean squared error is a measure of how well our line is fitting the data, determined by looking at the distance between each predicted value and the actual value in the data. The R-squared variable is a measure of the proportion of the variance of the predicted variable that is captured by the model.

summary = lrModel.summary
summary.residuals.show()
print(summary.totalIterations)
print(summary.objectiveHistory)
print(summary.rootMeanSquaredError)
print(summary.r2)

Generalized Linear Regression🔗

standard linear regression is actually part of family of algorithms called generalized linear regression.
Spark has two implementation of this algorithm, one is optimized for large set of features and other is more general includes support for more algorithms, doesn’t scale to large number of features.
Generalized forms allow you to select your own noise function and support link function that specifies relationship between linear predictor and mean of distribution function.

Family	Response type	Supported links
Gaussian	Continuous	Identity*, Log, Inverse
Binomial	Binary	Logit*, Probit, CLogLog
Poisson	Count	Log*, Identity, Sqrt
Gamma	Continuous	Inverse*, Idenity, Log

Model Hyperparameters🔗

fitIntercept and regParam
family : description of error function to be used.
link : name of link function which provides relationship between linear predictor and mean of distribution function.
solver: algorithm to be used for optimization. currently only supports irls(iteratively reweighted least squares)
variancePower : variance function of the Tweedie distribution, characterizes relationship between the variance and mean of the distribution. Supports : 0(default) and [1, Infinity)
linkPower: The index in the power link function for the Tweedie family.

Training Parameters🔗

same as logistic regression.

Prediction Parameters🔗

linkPredictionCol: A column name that will hold the output of our link function for each prediction.

from pyspark.ml.regression import GeneralizedLinearRegression
glr = GeneralizedLinearRegression()\
  .setFamily("gaussian")\
  .setLink("identity")\
  .setMaxIter(10)\
  .setRegParam(0.3)\
  .setLinkPredictionCol("linkOut")
print glr.explainParams()
glrModel = glr.fit(df)

Training Summary🔗

common success metric
- R squared : The coefficient of determination; a measure of fit.
- The residuals : The difference between the label and the predicted value.

Decision Trees🔗

decision tree as applied to regression work similar to classification with only difference that they output a single number per leaf node instead of a label.
Simply, rather than training coefficients to model a function, decision tree regression simply creates a tree to predict numerical outputs
This is of significant consequence because unlike generalized linear regression, we can predict nonlinear functions in the input data. This also creates a significant risk of overfitting the data, so we need to be careful when tuning and evaluating these models.

Model Hyperparameters🔗

same as decision tree in classification except
impurity: represents the metric (information gain) for whether or not the model should split at a particular leaf node with a particular value or keep it as is. The only metric currently supported for regression trees is “variance.”

Training Parameters🔗

same as classification

Example🔗

from pyspark.ml.regression import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
print dtr.explainParams()
dtrModel = dtr.fit(df)

Random Forests and Gradient Boosted Trees🔗

both follow same basic concept as the decision tree, rather than training one tree we train many trees and then averages
In the random forest model, many de-correlated trees are trained and then averaged. With gradient-boosted trees, each tree makes a weighted prediction
both have same model hyperparameters and training parameters as classification models except for purity measure.
training parameters supports checkpointInterval

from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
rf =  RandomForestRegressor()
print rf.explainParams()
rfModel = rf.fit(df)
gbt = GBTRegressor()
print gbt.explainParams()
gbtModel = gbt.fit(df)

Advanced Methods🔗

Survival Regression (Accelerated Failure Time)🔗

statistician use survival analysis to understand survival rate of individuals, typically in controlled experiments.
Spark implements the accelerated failure time model, which, rather than describing the actual survival time, models the log of the survival time. This variation of survival regression is implemented in Spark because the more well-known Cox Proportional Hazard’s model is semi-parametric and does not scale well to large datasets.
By contrast, accelerated failure time does because each instance (row) contributes to the resulting model independently. Accelerated failure time does have different assumptions than the Cox survival model and therefore one is not necessarily a drop-in replacement for the other. See L. J. Wei’s paper on accelerated failure time for more information.

Isotonic Regression🔗

defines a piecewise linear function that is monotonically increasing.
if your data is going up and to the right in a given plot, this is an appropriate model.

Evaluators and Automating Model Tuning🔗

same core model tuning functionality similar to classification. We specify evaluator, pick a metric to optimize for and then train pipeline to perform that parameter tuning on our park.
RegressionEvaluator allows us to optimize for a number of common regression success metrics. Just like the classification evaluator, RegressionEvaluator expects two columns, a column representing the prediction and another representing the true label. The supported metrics to optimize for are the root mean squared error (“rmse”), the mean squared error (“mse”), the r2 metric (“r2”), and the mean absolute error (“mae”).

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
glr = GeneralizedLinearRegression().setFamily("gaussian").setLink("identity")
pipeline = Pipeline().setStages([glr])
params = ParamGridBuilder().addGrid(glr.regParam, [0, 0.5, 1]).build()
evaluator = RegressionEvaluator()\
  .setMetricName("rmse")\
  .setPredictionCol("prediction")\
  .setLabelCol("label")
cv = CrossValidator()\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)\
  .setEstimatorParamMaps(params)\
  .setNumFolds(2) # should always be 3 or more but this dataset is small
model = cv.fit(df)

Metrics🔗

we can also access a number of regression metrics via the RegressionMetrics object.

from pyspark.mllib.evaluation import RegressionMetrics
out = model.transform(df)\
  .select("prediction", "label").rdd.map(lambda x: (float(x[0]), float(x[1])))
metrics = RegressionMetrics(out)
print("MSE: " + str(metrics.meanSquaredError))
print("RMSE: " + str(metrics.rootMeanSquaredError))
print("R-squared: " + str(metrics.r2))
print("MAE: " + str(metrics.meanAbsoluteError))
print("Explained variance: " + str(metrics.explainedVariance))