The Problem

Given the number of rooms and area (in square feet) of a type of dwelling, figure out if it’s an apartment, house, or flat.

As always, we’re starting with the most contrived possible problem in order to learn the basics. The description of this problem has given us the features we need to look at: number of rooms, and square feet. We can also assume that, since this is a supervised learning problem, we’ll be given a handful of example apartments, houses, and flats.

What “k-nearest-neighbor” Means

I think the best way to teach the kNN algorithm is to simply define what the phrase “k-nearest-neighbor” actually means.

Here’s a table of the example data we’re given for this problem:

Rooms Area Type
1 350 apartment
2 300 apartment
3 300 apartment
4 250 apartment
4 500 apartment
4 400 apartment
5 450 apartment
7 850 house
7 900 house
7 1200 house
8 1500 house
9 1300 house
8 1240 house
10 1700 house
9 1000 house
1 800 flat
3 900 flat
2 700 flat
1 900 flat
2 1150 flat
1 1000 flat
2 1200 flat
1 1300 flat

We’re going to plot the above as points on a graph in two dimensions, using number of rooms as the x-axis and the area as the y-axis.

When we inevitably run into a new, unlabeled data point (“mystery point”), we’ll put that on the graph too. Then we’ll pick a number (called “k”) and just find the “k” closest points on the graph to our mystery point. If the majority of the points close to the new point are “flats”, then we’ll guess that our mystery point is a flat.

That’s what k-nearest-neighbor means. “If the 3 (or 5 or 10, or ‘k’) nearest neighbors to the mystery point are two apartments and one house, then the mystery point is an apartment.”

Here’s the (simplified) procedure:

If you’re having trouble visualizing this, please take a quick break to scroll down to the bottom of the page and run the JS fiddle. That should illustrate the concept. Then come back up here and continue reading!