Regressionπ
- regression is a logical extension of classification. Regression is act of predictin a real number(cont. variable) from a set of features (represented as numbers).
- regression can be much harder as there are infinite possible output values, we try to optimize some error metric between true value and predicted value.
Use Casesπ
- predicting moview turnout from initial responses like trailer watches, social shares etc.
- predicting company revenue from bunch of business parameters
- predicting crop yield based on weather
Regression Models in MLlibπ
- Linear regression
- Generalized linear regression
- Decision tree regression
- Random forest regression
- Gradient-boosted tree regression
- Survival regression
- Isotonic regression
- Factorization machines regressor
Model Scalabilityπ
Model | Number features | Training examples |
---|---|---|
Linear regression | 1 to 10 million | No limit |
Generalized linear regression | 4,096 | No limit |
Isotonic regression | N/A | Millions |
Decision trees | 1,000s | No limit |
Random forest | 10,000s | No limit |
Gradient-boosted trees | 1,000s | No limit |
Survival regression | 1 to 10 million | No limit |
Data used in examples:
Linear Regressionπ
- Linear Regression assumes that a linear combination of your input features results along with an amount of Gaussian error in output.
- Model Hyperparameters, Training Parameters are same as previous module or refer documentation.
from pyspark.ml.regression import LinearRegression
lr = LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
print lr.explainParams()
lrModel = lr.fit(df)
Training Summaryπ
The residuals are simply the weights for each of the features that we input into the model. The objective history shows how our training is going at every iteration. The root mean squared error is a measure of how well our line is fitting the data, determined by looking at the distance between each predicted value and the actual value in the data. The R-squared variable is a measure of the proportion of the variance of the predicted variable that is captured by the model.
summary = lrModel.summary
summary.residuals.show()
print(summary.totalIterations)
print(summary.objectiveHistory)
print(summary.rootMeanSquaredError)
print(summary.r2)
Generalized Linear Regressionπ
- standard linear regression is actually part of family of algorithms called generalized linear regression.
- Spark has two implementation of this algorithm, one is optimized for large set of features and other is more general includes support for more algorithms, doesnβt scale to large number of features.
- Generalized forms allow you to select your own noise function and support link function that specifies relationship between linear predictor and mean of distribution function.
Family | Response type | Supported links |
---|---|---|
Gaussian | Continuous | Identity*, Log, Inverse |
Binomial | Binary | Logit*, Probit, CLogLog |
Poisson | Count | Log*, Identity, Sqrt |
Gamma | Continuous | Inverse*, Idenity, Log |
Model Hyperparametersπ
fitIntercept
andregParam
family
: description of error function to be used.link
: name of link function which provides relationship between linear predictor and mean of distribution function.solver
: algorithm to be used for optimization. currently only supportsirls
(iteratively reweighted least squares)variancePower
: variance function of the Tweedie distribution, characterizes relationship between the variance and mean of the distribution. Supports :0(default)
and[1, Infinity)
linkPower
: The index in the power link function for the Tweedie family.
Training Parametersπ
- same as logistic regression.
Prediction Parametersπ
linkPredictionCol
: A column name that will hold the output of our link function for each prediction.
from pyspark.ml.regression import GeneralizedLinearRegression
glr = GeneralizedLinearRegression()\
.setFamily("gaussian")\
.setLink("identity")\
.setMaxIter(10)\
.setRegParam(0.3)\
.setLinkPredictionCol("linkOut")
print glr.explainParams()
glrModel = glr.fit(df)
Training Summaryπ
-
common success metric
- R squared : The coefficient of determination; a measure of fit.
- The residuals : The difference between the label and the predicted value.
Decision Treesπ
- decision tree as applied to regression work similar to classification with only difference that they output a single number per leaf node instead of a label.
- Simply, rather than training coefficients to model a function, decision tree regression simply creates a tree to predict numerical outputs
- This is of significant consequence because unlike generalized linear regression, we can predict nonlinear functions in the input data. This also creates a significant risk of overfitting the data, so we need to be careful when tuning and evaluating these models.
Model Hyperparametersπ
- same as decision tree in classification except
impurity
: represents the metric (information gain) for whether or not the model should split at a particular leaf node with a particular value or keep it as is. The only metric currently supported for regression trees is βvariance.β
Training Parametersπ
- same as classification
Exampleπ
from pyspark.ml.regression import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
print dtr.explainParams()
dtrModel = dtr.fit(df)
Random Forests and Gradient Boosted Treesπ
- both follow same basic concept as the decision tree, rather than training one tree we train many trees and then averages
- In the random forest model, many de-correlated trees are trained and then averaged. With gradient-boosted trees, each tree makes a weighted prediction
- both have same model hyperparameters and training parameters as classification models except for purity measure.
- training parameters supports
checkpointInterval
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
rf = RandomForestRegressor()
print rf.explainParams()
rfModel = rf.fit(df)
gbt = GBTRegressor()
print gbt.explainParams()
gbtModel = gbt.fit(df)
Advanced Methodsπ
Survival Regression (Accelerated Failure Time)π
- statistician use survival analysis to understand survival rate of individuals, typically in controlled experiments.
- Spark implements the accelerated failure time model, which, rather than describing the actual survival time, models the log of the survival time. This variation of survival regression is implemented in Spark because the more well-known Cox Proportional Hazardβs model is semi-parametric and does not scale well to large datasets.
- By contrast, accelerated failure time does because each instance (row) contributes to the resulting model independently. Accelerated failure time does have different assumptions than the Cox survival model and therefore one is not necessarily a drop-in replacement for the other. See L. J. Weiβs paper on accelerated failure time for more information.
Isotonic Regressionπ
- defines a piecewise linear function that is monotonically increasing.
- if your data is going up and to the right in a given plot, this is an appropriate model.
Evaluators and Automating Model Tuningπ
- same core model tuning functionality similar to classification. We specify evaluator, pick a metric to optimize for and then train pipeline to perform that parameter tuning on our park.
RegressionEvaluator
allows us to optimize for a number of common regression success metrics. Just like the classification evaluator,RegressionEvaluator
expects two columns, a column representing the prediction and another representing the true label. The supported metrics to optimize for are the root mean squared error (βrmseβ), the mean squared error (βmseβ), the r2 metric (βr2β), and the mean absolute error (βmaeβ).
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
glr = GeneralizedLinearRegression().setFamily("gaussian").setLink("identity")
pipeline = Pipeline().setStages([glr])
params = ParamGridBuilder().addGrid(glr.regParam, [0, 0.5, 1]).build()
evaluator = RegressionEvaluator()\
.setMetricName("rmse")\
.setPredictionCol("prediction")\
.setLabelCol("label")
cv = CrossValidator()\
.setEstimator(pipeline)\
.setEvaluator(evaluator)\
.setEstimatorParamMaps(params)\
.setNumFolds(2) # should always be 3 or more but this dataset is small
model = cv.fit(df)
Metricsπ
we can also access a number of regression metrics via the RegressionMetrics
object.
from pyspark.mllib.evaluation import RegressionMetrics
out = model.transform(df)\
.select("prediction", "label").rdd.map(lambda x: (float(x[0]), float(x[1])))
metrics = RegressionMetrics(out)
print("MSE: " + str(metrics.meanSquaredError))
print("RMSE: " + str(metrics.rootMeanSquaredError))
print("R-squared: " + str(metrics.r2))
print("MAE: " + str(metrics.meanAbsoluteError))
print("Explained variance: " + str(metrics.explainedVariance))