Unsupervised Learning🔗

Unsupervised learning is used less often because its usually little harder to apply and measure success. These challenges can become exacerbated at scale.
For instance, clustering in high-dimensional space can create odd clusters simply because of the properties of high-dimensional spaces, something referred to as the curse of dimensionality. https://en.wikipedia.org/wiki/Curse_of_dimensionality
Dimensionality increase corresponds to increase in sparsity of features and comes with noise in data.
Model can hone in noise instead of true features.
unsupervised learning is trying to discover patterns or derive a concise representation of the underlying structure of a given dataset.

Use Cases🔗

Finding Anomalies in data
Topic Modelling to predict text and its context.

Model Scalability🔗

Model	Statistical recommendation	Computation limits	Training examples
k-means	50 to 100 maximum	Features x clusters < 10 million	No limit
Bisecting k-means	50 to 100 maximum	Features x clusters < 10 million	No limit
GMM	50 to 100 maximum	Features x clusters < 10 million	No limit
LDA	An interpretable number	1,000s of topics	No limit

Data we will operate on

from pyspark.ml.feature import VectorAssembler
va = VectorAssembler()\
  .setInputCols(["Quantity", "UnitPrice"])\
  .setOutputCol("features")

sales = va.transform(spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/data/retail-data/by-day/*.csv")
  .limit(50)
  .coalesce(1)
  .where("Description IS NOT NULL"))

sales.cache()

k-means🔗

one of the most popular clustering algorithms, in this method we take k clusters and assign datapoints to the clusters. The unassigned points are assigned based on their proximity (Euclidean distance) to previously assigned points
As assignment happens, all points are assigned to a particular centroid, and a new centroid is computed. We repeat this process for finite iteration until convergence.
It often a good idea to perform multiple runs of k-means starting with different intializations. Choosing the right value of k is important.

Model Hyperparameters🔗

k : number of clusters that you would like to end up with.

Training Parameters🔗

initMode : algorithm that determines starting point of centroids. Supports options are random and k-means|| (default). The later is parallelized variant of k-means method.
initSteps : number of steps for k-mean|| initialization mode. must be greater than 0, default 2
maxIter : total number of iteration over the data before stopping. Changing this probably doesn’t affect the outcomes too much. default 20
tol : threshold by which changes in centroids show that we optimized our model enough to stop iteration. default - 0.0001

Example🔗

from pyspark.ml.clustering import KMeans
km = KMeans().setK(5)
print km.explainParams()
kmModel = km.fit(sales)

k-Means Metrics Summary🔗

k-means includes a summary class that we can use to evaluate our model.
It includes info about clusters created and their relative sizes, we usually try to minimize the set of sum of squared error, subject to given number k of clusters

summary = kmModel.summary
print(summary.clusterSizes) # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Bisecting k-means🔗

variant of k-means. difference is instead of clustering points by starting bottom-up and assigning bunch of different groups in the data, this is a top-down clustering method.
Starts with simple 1 group and then groups data in smaller groups to end up with k clusters

Mode Hyperparameters🔗

k : number of clusters to end up with

Training Parameters🔗

minDivisibleClusterSize : min. number of points or min. proportions of points of a divisible cluster. default is 1, meaning that there must be at least one point in each cluster.
maxIter : same as above

from pyspark.ml.clustering import BisectingKMeans
bkm = BisectingKMeans().setK(5).setMaxIter(5)
bkmModel = bkm.fit(sales)

Bisecting k-means Summary🔗

# in Python
summary = bkmModel.summary
print summary.clusterSizes # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Gaussian Mixture Models🔗

GMM are another popular clustering algorithm, in GMM each cluster of data should be less likely to have data at the edge of cluster and much higher probability of having data in center.

Model Hyperparameters🔗

k : number of clusters

Training Parameters🔗

maxIter
tol

Example🔗

from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture().setK(5)
print gmm.explainParams()
model = gmm.fit(sales)

Gaussian Mixture Model Summary🔗

summary = model.summary
print model.weights
model.gaussiansDF.show()
summary.cluster.show()
summary.clusterSizes
summary.probability.show()

Latent Dirichlet Allocations🔗

Latent Dirichlet Allocation (LDA) is a hierarchical clustering model typically used to perform topic modelling on text documents. LDA tries to extract high-level topics from a series of documents and keywords associated with those topics. It then interprets each document as having a variable number of contributions from multiple input topics.
There are two implementation of LDA
- Online LDA
- Expectation Maximization (used for large input vocabulary)
To input our text data into LDA, we’re going to have to convert it into a numeric format. You can use the CountVectorizer to achieve this.

Model Hyperparameters🔗

k
docConcentration
- Concentration parameter (commonly named “alpha”) for the prior placed on documents’ distributions over topics (“theta”). This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
- If not set by the user, then docConcentration is set automatically. If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting. Otherwise, the docConcentration vector must be length 𝘬.
topicConcentration : The concentration parameter (commonly named “beta” or “eta”) for the prior placed on a topic’s distributions over terms. This is the parameter to a symmetric Dirichlet distribution. If not set by the user, then topicConcentration is set automatically.

Training Parameters🔗

maxIter
optimizer : determines whether use EM or online training oftimization. default is online
learningDecay : Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. The default is 0.51 and only applies to the online optimizer.
learningOffset : A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. The default is 1,024.0 and only applies to the online optimizer.
optimizeDocConcentration : Indicates whether the docConcentration (Dirichlet parameter for document-topic distribution) will be optimized during training. The default is true but only applies to the online optimizer.
subsamplingRate : The fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. The default is 0.5 and only applies to the online optimizer.
seed
checkpointInterval

Prediction Parameters🔗

topicDistributionCol : The column that will hold the output of topic mixture distribution for each document

Example🔗

# prepare data
from pyspark.ml.feature import Tokenizer, CountVectorizer
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.drop("features"))
cv = CountVectorizer()\
  .setInputCol("DescOut")\
  .setOutputCol("features")\
  .setVocabSize(500)\
  .setMinTF(0)\
  .setMinDF(0)\
  .setBinary(True)
cvFitted = cv.fit(tokenized)
prepped = cvFitted.transform(tokenized)

# apply LDA
from pyspark.ml.clustering import LDA
lda = LDA().setK(10).setMaxIter(5)
print lda.explainParams()
model = lda.fit(prepped)

# describing models
model.describeTopics(3).show()
cvFitted.vocabulary