Unsupervised Learningπ
- Unsupervised learning is used less often because its usually little harder to apply and measure success. These challenges can become exacerbated at scale.
- For instance, clustering in high-dimensional space can create odd clusters simply because of the properties of high-dimensional spaces, something referred to as the curse of dimensionality. https://en.wikipedia.org/wiki/Curse_of_dimensionality
- Dimensionality increase corresponds to increase in sparsity of features and comes with noise in data.
- Model can hone in noise instead of true features.
- unsupervised learning is trying to discover patterns or derive a concise representation of the underlying structure of a given dataset.
Use Casesπ
- Finding Anomalies in data
- Topic Modelling to predict text and its context.
Model Scalabilityπ
Model | Statistical recommendation | Computation limits | Training examples |
---|---|---|---|
k-means | 50 to 100 maximum | Features x clusters < 10 million | No limit |
Bisecting k-means | 50 to 100 maximum | Features x clusters < 10 million | No limit |
GMM | 50 to 100 maximum | Features x clusters < 10 million | No limit |
LDA | An interpretable number | 1,000s of topics | No limit |
Data we will operate on
from pyspark.ml.feature import VectorAssembler
va = VectorAssembler()\
.setInputCols(["Quantity", "UnitPrice"])\
.setOutputCol("features")
sales = va.transform(spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/data/retail-data/by-day/*.csv")
.limit(50)
.coalesce(1)
.where("Description IS NOT NULL"))
sales.cache()
k-meansπ
- one of the most popular clustering algorithms, in this method we take
k
clusters and assign datapoints to the clusters. The unassigned points are assigned based on their proximity (Euclidean distance) to previously assigned points - As assignment happens, all points are assigned to a particular centroid, and a new centroid is computed. We repeat this process for finite iteration until convergence.
- It often a good idea to perform multiple runs of k-means starting with different intializations. Choosing the right value of k is important.
Model Hyperparametersπ
k
: number of clusters that you would like to end up with.
Training Parametersπ
initMode
: algorithm that determines starting point of centroids. Supports options arerandom
andk-means||
(default). The later is parallelized variant ofk-means
method.initSteps
: number of steps for k-mean||
initialization mode. must be greater than 0, default 2maxIter
: total number of iteration over the data before stopping. Changing this probably doesnβt affect the outcomes too much. default 20tol
: threshold by which changes in centroids show that we optimized our model enough to stop iteration. default - 0.0001
Exampleπ
from pyspark.ml.clustering import KMeans
km = KMeans().setK(5)
print km.explainParams()
kmModel = km.fit(sales)
k-Means Metrics Summaryπ
- k-means includes a summary class that we can use to evaluate our model.
- It includes info about clusters created and their relative sizes, we usually try to minimize the set of sum of squared error, subject to given number k of clusters
summary = kmModel.summary
print(summary.clusterSizes) # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Bisecting k-meansπ
- variant of k-means. difference is instead of clustering points by starting
bottom-up
and assigning bunch of different groups in the data, this is atop-down
clustering method. - Starts with simple 1 group and then groups data in smaller groups to end up with k clusters
Mode Hyperparametersπ
k
: number of clusters to end up with
Training Parametersπ
minDivisibleClusterSize
: min. number of points or min. proportions of points of a divisible cluster. default is 1, meaning that there must be at least one point in each cluster.maxIter
: same as above
from pyspark.ml.clustering import BisectingKMeans
bkm = BisectingKMeans().setK(5).setMaxIter(5)
bkmModel = bkm.fit(sales)
Bisecting k-means Summaryπ
# in Python
summary = bkmModel.summary
print summary.clusterSizes # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Gaussian Mixture Modelsπ
- GMM are another popular clustering algorithm, in GMM each cluster of data should be less likely to have data at the edge of cluster and much higher probability of having data in center.
Model Hyperparametersπ
k
: number of clusters
Training Parametersπ
maxIter
tol
Exampleπ
from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture().setK(5)
print gmm.explainParams()
model = gmm.fit(sales)
Gaussian Mixture Model Summaryπ
summary = model.summary
print model.weights
model.gaussiansDF.show()
summary.cluster.show()
summary.clusterSizes
summary.probability.show()
Latent Dirichlet Allocationsπ
- Latent Dirichlet Allocation (LDA) is a hierarchical clustering model typically used to perform topic modelling on text documents. LDA tries to extract high-level topics from a series of documents and keywords associated with those topics. It then interprets each document as having a variable number of contributions from multiple input topics.
- There are two implementation of LDA
- Online LDA
- Expectation Maximization (used for large input vocabulary)
- To input our text data into LDA, weβre going to have to convert it into a numeric format. You can use the
CountVectorizer
to achieve this.
Model Hyperparametersπ
k
docConcentration
- Concentration parameter (commonly named βalphaβ) for the prior placed on documentsβ distributions over topics (βthetaβ). This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
- If not set by the user, then
docConcentration
is set automatically. If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting. Otherwise, thedocConcentration
vector must be length π¬.
topicConcentration
: The concentration parameter (commonly named βbetaβ or βetaβ) for the prior placed on a topicβs distributions over terms. This is the parameter to a symmetric Dirichlet distribution. If not set by the user, thentopicConcentration
is set automatically.
Training Parametersπ
maxIter
optimizer
: determines whether use EM or online training oftimization. default is onlinelearningDecay
: Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. The default is0.51
and only applies to the online optimizer.learningOffset
: A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. The default is1,024.0
and only applies to the online optimizer.optimizeDocConcentration
: Indicates whether thedocConcentration
(Dirichlet parameter for document-topic distribution) will be optimized during training. The default istrue
but only applies to the online optimizer.subsamplingRate
: The fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. The default is0.5
and only applies to the online optimizer.seed
checkpointInterval
Prediction Parametersπ
topicDistributionCol
: The column that will hold the output of topic mixture distribution for each document
Exampleπ
# prepare data
from pyspark.ml.feature import Tokenizer, CountVectorizer
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.drop("features"))
cv = CountVectorizer()\
.setInputCol("DescOut")\
.setOutputCol("features")\
.setVocabSize(500)\
.setMinTF(0)\
.setMinDF(0)\
.setBinary(True)
cvFitted = cv.fit(tokenized)
prepped = cvFitted.transform(tokenized)
# apply LDA
from pyspark.ml.clustering import LDA
lda = LDA().setK(10).setMaxIter(5)
print lda.explainParams()
model = lda.fit(prepped)
# describing models
model.describeTopics(3).show()
cvFitted.vocabulary