Skip to content

Unsupervised LearningπŸ”—

  • Unsupervised learning is used less often because its usually little harder to apply and measure success. These challenges can become exacerbated at scale.
  • For instance, clustering in high-dimensional space can create odd clusters simply because of the properties of high-dimensional spaces, something referred to as the curse of dimensionality. https://en.wikipedia.org/wiki/Curse_of_dimensionality
  • Dimensionality increase corresponds to increase in sparsity of features and comes with noise in data.
  • Model can hone in noise instead of true features.
  • unsupervised learning is trying to discover patterns or derive a concise representation of the underlying structure of a given dataset.

Use CasesπŸ”—

  • Finding Anomalies in data
  • Topic Modelling to predict text and its context.

Model ScalabilityπŸ”—

Model Statistical recommendation Computation limits Training examples
k-means 50 to 100 maximum Features x clusters < 10 million No limit
Bisecting k-means 50 to 100 maximum Features x clusters < 10 million No limit
GMM 50 to 100 maximum Features x clusters < 10 million No limit
LDA An interpretable number 1,000s of topics No limit

Data we will operate on

from pyspark.ml.feature import VectorAssembler
va = VectorAssembler()\
  .setInputCols(["Quantity", "UnitPrice"])\
  .setOutputCol("features")

sales = va.transform(spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/data/retail-data/by-day/*.csv")
  .limit(50)
  .coalesce(1)
  .where("Description IS NOT NULL"))

sales.cache()

k-meansπŸ”—

  • one of the most popular clustering algorithms, in this method we take k clusters and assign datapoints to the clusters. The unassigned points are assigned based on their proximity (Euclidean distance) to previously assigned points
  • As assignment happens, all points are assigned to a particular centroid, and a new centroid is computed. We repeat this process for finite iteration until convergence.
  • It often a good idea to perform multiple runs of k-means starting with different intializations. Choosing the right value of k is important.

Model HyperparametersπŸ”—

  • k : number of clusters that you would like to end up with.

Training ParametersπŸ”—

  • initMode : algorithm that determines starting point of centroids. Supports options are random and k-means|| (default). The later is parallelized variant of k-means method.
  • initSteps : number of steps for k-mean|| initialization mode. must be greater than 0, default 2
  • maxIter : total number of iteration over the data before stopping. Changing this probably doesn’t affect the outcomes too much. default 20
  • tol : threshold by which changes in centroids show that we optimized our model enough to stop iteration. default - 0.0001

ExampleπŸ”—

from pyspark.ml.clustering import KMeans
km = KMeans().setK(5)
print km.explainParams()
kmModel = km.fit(sales)

k-Means Metrics SummaryπŸ”—

  • k-means includes a summary class that we can use to evaluate our model.
  • It includes info about clusters created and their relative sizes, we usually try to minimize the set of sum of squared error, subject to given number k of clusters
summary = kmModel.summary
print(summary.clusterSizes) # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Bisecting k-meansπŸ”—

  • variant of k-means. difference is instead of clustering points by starting bottom-up and assigning bunch of different groups in the data, this is a top-down clustering method.
  • Starts with simple 1 group and then groups data in smaller groups to end up with k clusters

Mode HyperparametersπŸ”—

k : number of clusters to end up with

Training ParametersπŸ”—

  • minDivisibleClusterSize : min. number of points or min. proportions of points of a divisible cluster. default is 1, meaning that there must be at least one point in each cluster.
  • maxIter : same as above
from pyspark.ml.clustering import BisectingKMeans
bkm = BisectingKMeans().setK(5).setMaxIter(5)
bkmModel = bkm.fit(sales)

Bisecting k-means SummaryπŸ”—

# in Python
summary = bkmModel.summary
print summary.clusterSizes # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Gaussian Mixture ModelsπŸ”—

  • GMM are another popular clustering algorithm, in GMM each cluster of data should be less likely to have data at the edge of cluster and much higher probability of having data in center.

Model HyperparametersπŸ”—

  • k : number of clusters

Training ParametersπŸ”—

  • maxIter
  • tol

ExampleπŸ”—

from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture().setK(5)
print gmm.explainParams()
model = gmm.fit(sales)

Gaussian Mixture Model SummaryπŸ”—

summary = model.summary
print model.weights
model.gaussiansDF.show()
summary.cluster.show()
summary.clusterSizes
summary.probability.show()

Latent Dirichlet AllocationsπŸ”—

  • Latent Dirichlet Allocation (LDA) is a hierarchical clustering model typically used to perform topic modelling on text documents. LDA tries to extract high-level topics from a series of documents and keywords associated with those topics. It then interprets each document as having a variable number of contributions from multiple input topics.
  • There are two implementation of LDA
    • Online LDA
    • Expectation Maximization (used for large input vocabulary)
  • To input our text data into LDA, we’re going to have to convert it into a numeric format. You can use the CountVectorizer to achieve this.

Model HyperparametersπŸ”—

  • k
  • docConcentration
    • Concentration parameter (commonly named β€œalpha”) for the prior placed on documents’ distributions over topics (β€œtheta”). This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
    • If not set by the user, then docConcentration is set automatically. If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting. Otherwise, the docConcentration vector must be length 𝘬.
  • topicConcentration : The concentration parameter (commonly named β€œbeta” or β€œeta”) for the prior placed on a topic’s distributions over terms. This is the parameter to a symmetric Dirichlet distribution. If not set by the user, then topicConcentration is set automatically.

Training ParametersπŸ”—

  • maxIter
  • optimizer : determines whether use EM or online training oftimization. default is online
  • learningDecay : Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. The default is 0.51 and only applies to the online optimizer.
  • learningOffset : A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. The default is 1,024.0 and only applies to the online optimizer.
  • optimizeDocConcentration : Indicates whether the docConcentration (Dirichlet parameter for document-topic distribution) will be optimized during training. The default is true but only applies to the online optimizer.
  • subsamplingRate : The fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. The default is 0.5 and only applies to the online optimizer.
  • seed
  • checkpointInterval

Prediction ParametersπŸ”—

  • topicDistributionCol : The column that will hold the output of topic mixture distribution for each document

ExampleπŸ”—

# prepare data
from pyspark.ml.feature import Tokenizer, CountVectorizer
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.drop("features"))
cv = CountVectorizer()\
  .setInputCol("DescOut")\
  .setOutputCol("features")\
  .setVocabSize(500)\
  .setMinTF(0)\
  .setMinDF(0)\
  .setBinary(True)
cvFitted = cv.fit(tokenized)
prepped = cvFitted.transform(tokenized)

# apply LDA
from pyspark.ml.clustering import LDA
lda = LDA().setK(10).setMaxIter(5)
print lda.explainParams()
model = lda.fit(prepped)

# describing models
model.describeTopics(3).show()
cvFitted.vocabulary