Skip to content

RegressionπŸ”—

  • regression is a logical extension of classification. Regression is act of predictin a real number(cont. variable) from a set of features (represented as numbers).
  • regression can be much harder as there are infinite possible output values, we try to optimize some error metric between true value and predicted value.

Use CasesπŸ”—

  • predicting moview turnout from initial responses like trailer watches, social shares etc.
  • predicting company revenue from bunch of business parameters
  • predicting crop yield based on weather

Regression Models in MLlibπŸ”—

Model ScalabilityπŸ”—

Model Number features Training examples
Linear regression 1 to 10 million No limit
Generalized linear regression 4,096 No limit
Isotonic regression N/A Millions
Decision trees 1,000s No limit
Random forest 10,000s No limit
Gradient-boosted trees 1,000s No limit
Survival regression 1 to 10 million No limit

Data used in examples:

df = spark.read.load("/data/regression")

Linear RegressionπŸ”—

  • Linear Regression assumes that a linear combination of your input features results along with an amount of Gaussian error in output.
  • Model Hyperparameters, Training Parameters are same as previous module or refer documentation.
from pyspark.ml.regression import LinearRegression
lr = LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
print lr.explainParams()
lrModel = lr.fit(df)

Training SummaryπŸ”—

The residuals are simply the weights for each of the features that we input into the model. The objective history shows how our training is going at every iteration. The root mean squared error is a measure of how well our line is fitting the data, determined by looking at the distance between each predicted value and the actual value in the data. The R-squared variable is a measure of the proportion of the variance of the predicted variable that is captured by the model.

summary = lrModel.summary
summary.residuals.show()
print(summary.totalIterations)
print(summary.objectiveHistory)
print(summary.rootMeanSquaredError)
print(summary.r2)

Generalized Linear RegressionπŸ”—

  • standard linear regression is actually part of family of algorithms called generalized linear regression.
  • Spark has two implementation of this algorithm, one is optimized for large set of features and other is more general includes support for more algorithms, doesn’t scale to large number of features.
  • Generalized forms allow you to select your own noise function and support link function that specifies relationship between linear predictor and mean of distribution function.
Family Response type Supported links
Gaussian Continuous Identity*, Log, Inverse
Binomial Binary Logit*, Probit, CLogLog
Poisson Count Log*, Identity, Sqrt
Gamma Continuous Inverse*, Idenity, Log

Model HyperparametersπŸ”—

  • fitIntercept and regParam
  • family : description of error function to be used.
  • link : name of link function which provides relationship between linear predictor and mean of distribution function.
  • solver: algorithm to be used for optimization. currently only supports irls(iteratively reweighted least squares)
  • variancePower : variance function of the Tweedie distribution, characterizes relationship between the variance and mean of the distribution. Supports : 0(default) and [1, Infinity)
  • linkPower: The index in the power link function for the Tweedie family.

Training ParametersπŸ”—

  • same as logistic regression.

Prediction ParametersπŸ”—

  • linkPredictionCol: A column name that will hold the output of our link function for each prediction.
from pyspark.ml.regression import GeneralizedLinearRegression
glr = GeneralizedLinearRegression()\
  .setFamily("gaussian")\
  .setLink("identity")\
  .setMaxIter(10)\
  .setRegParam(0.3)\
  .setLinkPredictionCol("linkOut")
print glr.explainParams()
glrModel = glr.fit(df)

Training SummaryπŸ”—

  • common success metric

    • R squared : The coefficient of determination; a measure of fit.
    • The residuals : The difference between the label and the predicted value.

Decision TreesπŸ”—

  • decision tree as applied to regression work similar to classification with only difference that they output a single number per leaf node instead of a label.
  • Simply, rather than training coefficients to model a function, decision tree regression simply creates a tree to predict numerical outputs
  • This is of significant consequence because unlike generalized linear regression, we can predict nonlinear functions in the input data. This also creates a significant risk of overfitting the data, so we need to be careful when tuning and evaluating these models.

Model HyperparametersπŸ”—

  • same as decision tree in classification except
  • impurity: represents the metric (information gain) for whether or not the model should split at a particular leaf node with a particular value or keep it as is. The only metric currently supported for regression trees is β€œvariance.”

Training ParametersπŸ”—

  • same as classification

ExampleπŸ”—

from pyspark.ml.regression import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
print dtr.explainParams()
dtrModel = dtr.fit(df)

Random Forests and Gradient Boosted TreesπŸ”—

  • both follow same basic concept as the decision tree, rather than training one tree we train many trees and then averages
  • In the random forest model, many de-correlated trees are trained and then averaged. With gradient-boosted trees, each tree makes a weighted prediction
  • both have same model hyperparameters and training parameters as classification models except for purity measure.
  • training parameters supports checkpointInterval
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
rf =  RandomForestRegressor()
print rf.explainParams()
rfModel = rf.fit(df)
gbt = GBTRegressor()
print gbt.explainParams()
gbtModel = gbt.fit(df)

Advanced MethodsπŸ”—

Survival Regression (Accelerated Failure Time)πŸ”—

  • statistician use survival analysis to understand survival rate of individuals, typically in controlled experiments.
  • Spark implements the accelerated failure time model, which, rather than describing the actual survival time, models the log of the survival time. This variation of survival regression is implemented in Spark because the more well-known Cox Proportional Hazard’s model is semi-parametric and does not scale well to large datasets.
  • By contrast, accelerated failure time does because each instance (row) contributes to the resulting model independently. Accelerated failure time does have different assumptions than the Cox survival model and therefore one is not necessarily a drop-in replacement for the other. See L. J. Wei’s paper on accelerated failure time for more information.

Isotonic RegressionπŸ”—

  • defines a piecewise linear function that is monotonically increasing.
  • if your data is going up and to the right in a given plot, this is an appropriate model.

Evaluators and Automating Model TuningπŸ”—

  • same core model tuning functionality similar to classification. We specify evaluator, pick a metric to optimize for and then train pipeline to perform that parameter tuning on our park.
  • RegressionEvaluator allows us to optimize for a number of common regression success metrics. Just like the classification evaluator, RegressionEvaluator expects two columns, a column representing the prediction and another representing the true label. The supported metrics to optimize for are the root mean squared error (β€œrmse”), the mean squared error (β€œmse”), the r2 metric (β€œr2”), and the mean absolute error (β€œmae”).
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
glr = GeneralizedLinearRegression().setFamily("gaussian").setLink("identity")
pipeline = Pipeline().setStages([glr])
params = ParamGridBuilder().addGrid(glr.regParam, [0, 0.5, 1]).build()
evaluator = RegressionEvaluator()\
  .setMetricName("rmse")\
  .setPredictionCol("prediction")\
  .setLabelCol("label")
cv = CrossValidator()\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)\
  .setEstimatorParamMaps(params)\
  .setNumFolds(2) # should always be 3 or more but this dataset is small
model = cv.fit(df)

MetricsπŸ”—

we can also access a number of regression metrics via the RegressionMetrics object.

from pyspark.mllib.evaluation import RegressionMetrics
out = model.transform(df)\
  .select("prediction", "label").rdd.map(lambda x: (float(x[0]), float(x[1])))
metrics = RegressionMetrics(out)
print("MSE: " + str(metrics.meanSquaredError))
print("RMSE: " + str(metrics.rootMeanSquaredError))
print("R-squared: " + str(metrics.r2))
print("MAE: " + str(metrics.meanAbsoluteError))
print("Explained variance: " + str(metrics.explainedVariance))