Recommendation🔗

by observing people’s explicit preferences or implicit preferences, we can predict recommendation on what one user may like based on similarity among users

Use Cases🔗

Movie Recommendation by Netflix, Amazon, and HBO etc.
Course Recommendations

In spark, there is Alternating Least Squares (ALS) algorithms which uses a technique called collaborative filtering, which makes recommendations based only on which items users interacted with in past.

It doesn’t require any additional features about users or the time. It supports several ALS variants.

Collaborative Filtering with Alternating Least Squares🔗

ALS find a k-dimensional feature vector for each user user and item such that dot product of both approximates user’s rating for that item.
It only requires input dataset of existing ratings between user-item pairs, with three columns: UserID, ItemID, and a rating column.
NOTE: rating can be explicit - a numerical rating we aim to predict directly or implicit (based on user interaction)
ALS suffers from cold start problem in Recommendation Systems. https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)
In terms of scaling it scales to million and billions of ratings.

Model Hyperparameters🔗

rank (default 10) : dimension of feature vectors learned for users and items. This is tuned thru experimentation. Value of rank is proportional to fitting of data.
alpha : when training of implicit feedback, alpha set a baseline confidence for preference. default is 1 and should be driven based on experimentation
regParam : control regularization to prevent overfitting. default 0.1
implicitPrefs Boolean value, training of implicit data or not.
nonnegative : Boolean value, this parameter configures the model to place non-negative constraint on least-squares problem, and only return non-negative feature vectors.default is false

Training Parameters🔗

The groups of data that are distributed around the cluster are called blocks. Determining how much data to place in each block can have a significant impact on the time it takes to train the algorithm

A good rule of thumb is to aim for approximately one to five million ratings per block.

numUserBlocks : determines how many blocks to split users into. default 10
numItemBlocks : determines how many blocks to split items into. default 10
maxIter : Total iteration over the data before splitting. Changing this probably won’t change results a ton. default 10
checkpointInterval : Checkpointing allows you to save model state during training to more quickly recover from node failures. You can set a checkpoint directory using SparkContext.setCheckpointDir.
seed : Specifying a random seed can help you replicate your results.

Prediction Parameters🔗

defines how a trained model should actually make predictions
There is one parameter : the cold start strategy (coldStartStrategy) determines how model should predict for users or items that did not appear in training set.
The cold start challenge commonly arises when you’re serving a model in production, and new users and/or items have no ratings history, and therefore the model has no recommendation to make. It can also occur when using simple random splits as in Spark’s CrossValidator or TrainValidationSplit, where it is very common to encounter users and/or items in the evaluation set that are not in the training set.
By default, Spark will assign NaN prediction values when it encounters a user and/or item that is not present in the actual model. This can be useful because you design your overall system to fall back to some default recommendation when a new user or item is in the system. However, this is undesirable during training because it will ruin the ability for your evaluator to properly measure the success of your model. This makes model selection impossible. Spark allows users to set the coldStartStrategy parameter to drop in order to drop any rows in the DataFrame of predictions that contain NaN values. The evaluation metric will then be computed over the non-NaN data and will be valid. drop and nan (the default) are the only currently supported cold-start strategies.

Example🔗

from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
ratings = spark.read.text("/data/sample_movielens_ratings.txt")\
  .rdd.toDF()\
  .selectExpr("split(value , '::') as col")\
  .selectExpr(
    "cast(col[0] as int) as userId",
    "cast(col[1] as int) as movieId",
    "cast(col[2] as float) as rating",
    "cast(col[3] as long) as timestamp")
training, test = ratings.randomSplit([0.8, 0.2])
als = ALS()\
  .setMaxIter(5)\
  .setRegParam(0.01)\
  .setUserCol("userId")\
  .setItemCol("movieId")\
  .setRatingCol("rating")
print(als.explainParams())
alsModel = als.fit(training)
predictions = alsModel.transform(test)

# output top k-recommendations for each user or movie
alsModel.recommendForAllUsers(10)\
  .selectExpr("userId", "explode(recommendations)").show()
alsModel.recommendForAllItems(10)\
  .selectExpr("movieId", "explode(recommendations)").show()

Evaluators for Recommendation🔗

when covering cold-start strategy we can set up an automatic model evaluator working with ALS.
recommendation problem is really just a kind of regression problem. since we are prediction rating for given users we want to optimize for reducing the toatl difference between our user’s rating and the true values.
We can use same Regression Evaluator nad place this in a pipeline to automate training process. When we also set cold-strategy to drop instead of NaN and then switch back to NaN when it comes time to actually make prediction in production systems.

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator()\
  .setMetricName("rmse")\
  .setLabelCol("rating")\
  .setPredictionCol("prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = %f" % rmse)

Metric🔗

Recommendation results can be measured using both the standard regression metrics and some recommendation-specific metrics

Regression Metrics🔗

from pyspark.mllib.evaluation import RegressionMetrics
regComparison = predictions.select("rating", "prediction")\
  .rdd.map(lambda x: (x(0), x(1)))
metrics = RegressionMetrics(regComparison)

Ranking Metrics🔗

A RankingMetric allows us to compare our recommendations with an actual set of ratings (or preferences) expressed by a given user. RankingMetric does not focus on the value of the rank but rather whether or not our algorithm recommends an already ranked item again to a user. This does require some data preparation on our part.

from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics
from pyspark.sql.functions import col, expr
perUserActual = predictions\
  .where("rating > 2.5")\
  .groupBy("userId")\
  .agg(expr("collect_set(movieId) as movies"))

At this point, we have a collection of users, along with a truth set of previously ranked movies for each user. Now we will get our top 10 recommendations from our algorithm on a per-user basis.
We will see then the top 10 recommendations show up in our truth set or not.

perUserPredictions = predictions\
  .orderBy(col("userId"), expr("prediction DESC"))\
  .groupBy("userId")\
  .agg(expr("collect_list(movieId) as movies"))

Now we have two DataFrames, one of predictions and another the top-ranked items for a particular user. We can pass them into the RankingMetrics object.

perUserActualvPred = perUserActual.join(perUserPredictions, ["userId"]).rdd\
  .map(lambda row: (row[1], row[2][:15]))
ranks = RankingMetrics(perUserActualvPred)

# now observe the metrics
ranks.meanAveragePrecision
ranks.precisionAt(5)

Frequent Pattern Mining🔗

Frequent pattern mining, sometimes referred to as market basket analysis, looks at raw data and finds association rules.
For instance, given a large number of transactions it might identify that users who buy hot dogs almost always purchase hot dog buns.
This technique can be applied in the recommendation context, especially when people are filling shopping carts (either on or offline).