K-means

K-means in a nutshell:
assign points refit the cluster centres.

Both can come from optimizing a single objective function:

J=i=1nk=1Kεikxiμk2

Binary indicator:

εik={1if xik0otherwise

Basically εik=1 if k=argminjxiμj2 and εij=0 for jk.

Finding εik, minimizing J is easy.

εik={1if k=argminjxiμj20otherwise

Of course, the next step is to update μk. We can think intuitively that the best place for uk would be in the middle of all that is in "k".

Enough of intuition lad, Here's your mathematical jargon:

For the fixed k:

Jk=i=1nεikxiμk2Jkμk=i=1nεik(xiμk)=0

Literally just the mean of all the points belonging to cluster k, which is also refitting.

μk=i=1nεikxii=1nεik

Q&A

Q: Why initialize centres first?

A: If not done otherwise, i.e., using random centers to all points, makes it so that no points at all. μi=0, making it a dead centre.
On an unlucky assignment of centre this can happen too but very rarely either ways we will have closest cluster to any data point while the assignment step and algorithm keeps going.

E-step Expectation step (Assignment step)
M-step Maximization step (Refitting)

Error={1if new min(xnμi2)0otherwise

But this E-M step lead to local minima. We do random start on K and then find which led to the lowest cost. So K++.

K++

Pick first centre randomly.

Then we have D(x,c1) distance to Nearest centre, choose next centre by point xn with probability D(xn)2.
What this essentially does is that it favours points which are far away from already chosen centres making it more likely to pick a centre in a new cluster.

Use Cases

Lossy Compression

Pasted image 20250922160036.png

Image taken from Pattern Recognition and Machine Learning

We apply K-means in the following manner:
For each N data points, we only story the identity of k of the cluster to which it is assigned.
You also store the value of the K cluster-centres uk.

It's called a vector quantization framework.
uk code-book vectors

Example for Image segmentation:
For an image (k,kbit) with t bit precision, N×t bits is needed.

Figure:
So in our case of {R,G,B} with 8 bit precision we would need 24N bits.
Now suppose we first run K-means on the image data, and then instead of transmitting the original pixel intensity vectors we transmit the identity of the nearest vector uk . Because there are K such vectors, this requires log2(k) bits per pixel. We must also transmit the K code book vectors uk , which requires 24K bits, and so the total number of bits required to transmit the image is 24K+Nlog2(k) (rounding up to the nearest integer).

Medoid

Medoid: (not mean, an actual point in the dataset)

mincSxSD(x,c)

when i,jS.

Limitations

To kind of tackle this we use Mixture Models.