Prof. Frenzel
6 min readDec 26, 2022

--

Dear friends!

Have you ever had a dataset with so many points that it was hard to make sense of it all? Data clustering may be the solution you’re looking for. By grouping similar data points together, clustering algorithms help to uncover the underlying structure of your data and reveal patterns and relationships that may have been hidden before. 👋Melissa Summers and 👋I will take a first look at what data clustering is, the four main types, and how they can be used to analyze and understand complex datasets.

What is Clustering?

Clustering is a technique for analyzing and understanding datasets with no predefined labels or categories. This can make it challenging to conduct Exploratory Data Analysis (EDA) and draw conclusions about the relationships within the data. Clustering can be a useful tool in these situations because it allows us to group similar data points together, uncover patterns, and structure our data for the next steps in our analytical process. Clustering algorithms are a type of unsupervised machine-learning technique that can automatically identify patterns and group similar data points together without human intervention. There are many different types of clustering algorithms, but the following four are among the most commonly used.

Hierarchical Clustering

Hierarchical clustering is a type of clustering algorithm that is used to group data points into clusters based on their similarity. It works by building a hierarchy of clusters, where each cluster is successively divided into smaller clusters until each data point belongs to its own cluster. It is very useful for identifying clusters of any shape and visualizing the relationships between data points, but may not be as efficient as other clustering methods for large datasets. The main output is a dendrogram that depicts every possible cluster found in the dataset. The output of hierarchical clustering can be either advantageous or disadvantageous, depending on the context. The algorithm generates an extensive number of clusters, which can take a significant amount of time to compute and may be excessively detailed for some analyses. On the other hand, the ability to produce a large number of clusters can be beneficial in certain situations where a high level of granularity is desired.

Centroid-based Clustering

Centroid-based clustering organizes the data into non-hierarchical clusters by iteratively measuring the distance between each cluster and its nearest data point. The most popular tool for conducting centroid-based clustering is the K-means algorithm because it categorizes each data point into precisely one cluster. However, the algorithm requires that the user input a “K” number of clusters for the data to be segmented into. For example, say you specify that K= 3, the computer then finds 3 similarly distant middle points, called centroids. Each data point is then clustered according to its distance to a centroid. To find the optimal number of clusters, the computer can recalculate each centroid based on the mean distance of all the objects within the cluster. This reiteration continues until all of the data points stop shifting and are fixed within their respective clusters. However, due to the model’s focus on the centroid, the borders around a cluster can have trouble dealing with outliers. Further, if a data point is located close to a border and could potentially be considered a part of two clusters, there is no way to see this information.

Expectation-maximization (EM) Clustering

This type of clustering expands on the centroid-based model we just discussed, but handles outliers more aptly by allowing data points to exist in two clusters simultaneously. The main tool used for this algorithm are Gaussian Mixture Models (GMM), which output a probability distribution displaying all the possibilities for each cluster. This output allows the user to visually see how each cluster is normally distributed as well as how each cluster overlaps. This makes it the better tool to kickstart in-depth analysis, as it brings up questions about the interrelatedness of the data. EM clustering is particularly well-suited for datasets with hidden or missing data, as it can handle these cases by estimating the missing values during the clustering process. However, problems may arise if the dataset does not follow Gaussian (normal) distribution and the user must also specify the number of clusters in advance.

Density-based Clustering

Density-based clustering is the preferred clustering model used by data scientists due to the fact that it automatically calculates the number of clusters needed, and is highly accurate. The main point of this model is to divide the dataset into clusters based on the ε parameter, known as the “neighborhood” distance. Simply put, if a data point is located within the circular radius of the inputted ε parameter, it is included in the cluster. Otherwise, if a data point is too far out of the radius it is classified as noise. The main tool used for this model is the DBSCAN (Density-Based Spatial Clustering of Application with Noise). While this model is the chosen favorite, if the placement of the data is too condensed then identifying an ε parameter becomes tricky and can lead to poor results.

👣Example

To illustrate clustering, think about how companies use personalized ads to enhance their marketing strategies. A clothing retailer could use clustering algorithms to group customers into different segments, the retailer would first collect data on the characteristics of their customers, such as their age, gender, location, income, and purchasing history. They might also gather data on customers’ preferences and interests, such as their favorite brands or types of clothing.

Next, the retailer would use a clustering algorithm to analyze this data and identify groups of customers with similar characteristics. For example, the algorithm might identify the following three clusters of customers:

  • Cluster 1: Young, fashion-conscious women who live in urban areas and have high incomes. These customers frequently purchase high-end clothing and accessories from designer brands, and are interested in trendy, fashionable styles.
  • Cluster 2: Older, budget-conscious men who live in suburban areas and have moderate incomes. These customers mostly purchase practical, affordable clothing, and are interested in practical, comfortable styles.
  • Cluster 3: Younger, budget-conscious women who live in rural areas and have low incomes. These customers mostly purchase affordable clothing, and are interested in practical, comfortable styles.

The retailer could then use this information to develop targeted marketing campaigns and personalized recommendations for each cluster. For instance, the marketing team might send promotional emails or ads featuring fashionable, high-end clothing and accessories to the young, fashion-conscious women in cluster 1, and send emails or ads featuring practical, affordable clothing to the older, budget-conscious men in cluster 2 and the younger, budget-conscious women in cluster 3.

This approach can be more effective than a one-size-fits-all marketing strategy, as it allows the retailer to tailor their efforts to the specific needs and preferences of each customer cluster.

Clustering algorithms are a powerful tool for analyzing and understanding complex datasets, but it is not meant to be a single answer to your questions. Rather, it acts as a catalyst for more insightful questions about your data and is a great first step when beginning EDA on unlabelled datasets. Choosing the right algorithm to cluster your data is half the battle, and this depends on the context of your research and the type of data you are presented with. Ultimately, you will end up using many different clustering algorithms on your EDA journey to leverage each of their advantages, as all of them can lead you to new insights.

--

--

Prof. Frenzel

Data Scientist | Engineer - Professor | Entrepreneur - Investor | Finance - World Traveler