Recommendation๐
- by observing peopleโs explicit preferences or implicit preferences, we can predict recommendation on what one user may like based on similarity among users
Use Cases๐
- Movie Recommendation by Netflix, Amazon, and HBO etc.
- Course Recommendations
In spark, there is Alternating Least Squares (ALS) algorithms which uses a technique called collaborative filtering, which makes recommendations based only on which items users interacted with in past.
It doesnโt require any additional features about users or the time. It supports several ALS variants.
Collaborative Filtering with Alternating Least Squares๐
- ALS find a k-dimensional feature vector for each user user and item such that dot product of both approximates userโs rating for that item.
- It only requires input dataset of existing ratings between user-item pairs, with three columns: UserID, ItemID, and a rating column.
- NOTE: rating can be explicit - a numerical rating we aim to predict directly or implicit (based on user interaction)
- ALS suffers from cold start problem in Recommendation Systems. https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)
- In terms of scaling it scales to million and billions of ratings.
Model Hyperparameters๐
rank
(default 10) : dimension of feature vectors learned for users and items. This is tuned thru experimentation. Value of rank is proportional to fitting of data.alpha
: when training of implicit feedback, alpha set a baseline confidence for preference. default is 1 and should be driven based on experimentationregParam
: control regularization to prevent overfitting. default 0.1implicitPrefs
Boolean value, training of implicit data or not.nonnegative
: Boolean value, this parameter configures the model to place non-negative constraint on least-squares problem, and only return non-negative feature vectors.default isfalse
Training Parameters๐
The groups of data that are distributed around the cluster are called blocks. Determining how much data to place in each block can have a significant impact on the time it takes to train the algorithm
A good rule of thumb is to aim for approximately one to five million ratings per block.
numUserBlocks
: determines how many blocks to split users into. default 10numItemBlocks
: determines how many blocks to split items into. default 10maxIter
: Total iteration over the data before splitting. Changing this probably wonโt change results a ton. default 10checkpointInterval
: Checkpointing allows you to save model state during training to more quickly recover from node failures. You can set a checkpoint directory usingSparkContext.setCheckpointDir
.seed
: Specifying a random seed can help you replicate your results.
Prediction Parameters๐
- defines how a trained model should actually make predictions
- There is one parameter : the cold start strategy (
coldStartStrategy
) determines how model should predict for users or items that did not appear in training set. - The cold start challenge commonly arises when youโre serving a model in production, and new users and/or items have no ratings history, and therefore the model has no recommendation to make. It can also occur when using simple random splits as in Sparkโs
CrossValidator
orTrainValidationSplit
, where it is very common to encounter users and/or items in the evaluation set that are not in the training set. - By default, Spark will assign
NaN
prediction values when it encounters a user and/or item that is not present in the actual model. This can be useful because you design your overall system to fall back to some default recommendation when a new user or item is in the system. However, this is undesirable during training because it will ruin the ability for your evaluator to properly measure the success of your model. This makes model selection impossible. Spark allows users to set thecoldStartStrategy
parameter todrop
in order to drop any rows in the DataFrame of predictions that containNaN
values. The evaluation metric will then be computed over the non-NaN
data and will be valid.drop
andnan
(the default) are the only currently supported cold-start strategies.
Example๐
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
ratings = spark.read.text("/data/sample_movielens_ratings.txt")\
.rdd.toDF()\
.selectExpr("split(value , '::') as col")\
.selectExpr(
"cast(col[0] as int) as userId",
"cast(col[1] as int) as movieId",
"cast(col[2] as float) as rating",
"cast(col[3] as long) as timestamp")
training, test = ratings.randomSplit([0.8, 0.2])
als = ALS()\
.setMaxIter(5)\
.setRegParam(0.01)\
.setUserCol("userId")\
.setItemCol("movieId")\
.setRatingCol("rating")
print(als.explainParams())
alsModel = als.fit(training)
predictions = alsModel.transform(test)
# output top k-recommendations for each user or movie
alsModel.recommendForAllUsers(10)\
.selectExpr("userId", "explode(recommendations)").show()
alsModel.recommendForAllItems(10)\
.selectExpr("movieId", "explode(recommendations)").show()
Evaluators for Recommendation๐
- when covering cold-start strategy we can set up an automatic model evaluator working with ALS.
- recommendation problem is really just a kind of regression problem. since we are prediction rating for given users we want to optimize for reducing the toatl difference between our userโs rating and the true values.
- We can use same Regression Evaluator nad place this in a pipeline to automate training process. When we also set cold-strategy to
drop
instead ofNaN
and then switch back toNaN
when it comes time to actually make prediction in production systems.
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator()\
.setMetricName("rmse")\
.setLabelCol("rating")\
.setPredictionCol("prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = %f" % rmse)
Metric๐
- Recommendation results can be measured using both the standard regression metrics and some recommendation-specific metrics
Regression Metrics๐
from pyspark.mllib.evaluation import RegressionMetrics
regComparison = predictions.select("rating", "prediction")\
.rdd.map(lambda x: (x(0), x(1)))
metrics = RegressionMetrics(regComparison)
Ranking Metrics๐
A RankingMetric
allows us to compare our recommendations with an actual set of ratings (or preferences) expressed by a given user. RankingMetric
does not focus on the value of the rank but rather whether or not our algorithm recommends an already ranked item again to a user. This does require some data preparation on our part.
from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics
from pyspark.sql.functions import col, expr
perUserActual = predictions\
.where("rating > 2.5")\
.groupBy("userId")\
.agg(expr("collect_set(movieId) as movies"))
- At this point, we have a collection of users, along with a truth set of previously ranked movies for each user. Now we will get our top 10 recommendations from our algorithm on a per-user basis.
- We will see then the top 10 recommendations show up in our truth set or not.
perUserPredictions = predictions\
.orderBy(col("userId"), expr("prediction DESC"))\
.groupBy("userId")\
.agg(expr("collect_list(movieId) as movies"))
- Now we have two DataFrames, one of predictions and another the top-ranked items for a particular user. We can pass them into the
RankingMetrics
object.
perUserActualvPred = perUserActual.join(perUserPredictions, ["userId"]).rdd\
.map(lambda row: (row[1], row[2][:15]))
ranks = RankingMetrics(perUserActualvPred)
# now observe the metrics
ranks.meanAveragePrecision
ranks.precisionAt(5)
Frequent Pattern Mining๐
- Frequent pattern mining, sometimes referred to as market basket analysis, looks at raw data and finds association rules.
- For instance, given a large number of transactions it might identify that users who buy hot dogs almost always purchase hot dog buns.
- This technique can be applied in the recommendation context, especially when people are filling shopping carts (either on or offline).