January 24, 2019 by Giuseppe Vettigli

A Visual Introduction to Gap Statistics

We have previously seen how to implement K-Means. However, the results of this algorithm strongly rely on the choice of the parameter K. In this post, we will see how to use Gap Statistics to pick K in an optimal way. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of K-Means increasing K:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans


reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(reference)
    plt.subplot(1,5,k)
    plt.scatter(reference[:, 0], reference[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()

WPBeginner Turns 15 Years Old – Reflections, Updates, and a Giveaway ($50,000 in Prizes)
In birthday giveaway, giveaway, wpbeginner birthday
It’s quite surreal to type that WPBeginner turns 15 years old today! Time flies when you’re having fun especially with such an amazing community of website owners, small businesses, and web professionals. YOU ARE the best part of WPBeginner! Like every year, I will take… Read More »

The post WPBeginner Turns 15 Years Old – Reflections, Updates, and a Giveaway ($50,000 in Prizes) first appeared on WPBeginner.
[…]
The Art of Manual Regression Testing
No categories
The tech world of software development is characterized by fast-paced and constant evolution. Code keeps changing, new features are introduced, and bugs are fixed frequently. These changes are crucial for improving the overall development structure. Ho... […]
Understanding Properties of Zero Trust Networks
No categories
Zero Trust is a well-known but 'hard-to-implement' paradigm in computer network security. As the name suggests, Zero Trust is a set of core system design principles and concepts that seek to eliminate the practice of implicit trust-based security. The ... […]
Mastering Distributed Caching on AWS: Strategies, Services, and Best Practices
No categories
Distributed caching is a method for storing and managing data across multiple servers, ensuring high availability, fault tolerance, and improved read/write performance. In cloud environments like AWS (Amazon Web Services), distributed caching is pivota... […]
Step-By-Step Guide To Crafting an Effective Bug Report
No categories
Bugs are an integral part of the development process. Along with the bugs you need to write a bug report. So in this blog post, we are sharing some effective tips and tricks to write bug reports. Bugs are bound to happen when you’re developing an ... […]

Proudly powered by WordPress