K-means
K-means in a nutshell:
assign points 
Both can come from optimizing a single objective function:
Binary indicator:
Basically 
Finding 
Of course, the next step is to update 
Enough of intuition lad, Here's your mathematical jargon:
For the fixed 
Literally just the mean of all the points belonging to cluster 
Q&A
Q: Why initialize centres first?
A: If not done otherwise, i.e., using random centers to all points, makes it so that no points at all. 
On an unlucky assignment of centre this can happen too but very rarely either ways we will have closest cluster to any data point while the assignment step and algorithm keeps going.
E-step 
M-step 
But this E-M step lead to local minima. We do random start on 
K++
Pick first centre randomly.
Then we have 
What this essentially does is that it favours points which are far away from already chosen centres making it more likely to pick a centre in a new cluster.
Use Cases
Lossy Compression

Image taken from Pattern Recognition and Machine Learning
We apply K-means in the following manner:
For each N data points, we only story the identity of k of the cluster to which it is assigned.
You also store the value of the K cluster-centres 
It's called a vector quantization framework.
Example for Image segmentation:
For an image 
Figure:
So in our case of {R,G,B} with 8 bit precision we would need 
Now suppose we first run K-means on the image data, and then instead of transmitting the original pixel intensity vectors we transmit the identity of the nearest vector 
Medoid
Medoid: (not mean, an actual point in the dataset)
when 
Limitations
- Assumes spherical clusters
 - Hard assignments
 - Sensitive to outliers
 
To kind of tackle this we use Mixture Models.