# k-Nearest Neighbor

In machine learning, one of the best ways

to learn more about your data is by classifying it with what you already know. You can group together a bunch of data based

on similar characteristics. Because you already know these characteristics,

you can classify the data using supervised machine learning. A very common supervised machine learning

algorithm for multiclass classification is k-Nearest Neighbor. This is an instance-based machine learning

algorithm, or what’s also called lazy learning. With lazy learning, the bulk of the computation

happens right before you want to classify your data. The learning doesn’t happen continuously. Instead, you run all your computation in one

big instance. In a sense, you’re saving up all your energy

for one big splash. k-NN compares something you don’t know to

what you already have. So, in some ways, you’re getting immediately

rewarded for the size and quality of your training data. The downside is that this takes a lot of computational

power, so sometimes it’s difficult to use k-NN on very large data sets. Think of it this way. For a veterinary doctor, one of the most difficult

jobs was trying to classify the breed of each new dog. There are hundreds of different known dog

breeds. Not only that, the dogs aren’t that close-minded

about who they breed with, so you have a lot of mixed breeds. Each time when there is a new dog, the veterinary

doctor would hold it up to several of the existing dogs that were already classified. Then we’d look at some of the characteristics. Maybe it was the shape of their face or the

color of their hair. In a sense, the purpose was trying to classify

the unknown dog by looking for its nearest neighbor. Another way to look at it is you’re trying

to minimize the distance between the unknown dog and the known breeds. If the characteristics were closely matched,

then there was a very short distance between the unknown dog and its nearest neighbor. Minimizing the distance is a key part of k-Nearest

Neighbor. The closer you are to your nearest neighbors,

the more likely you are to be accurate. The most common way to do this is through

something called Euclidean distance. There is a pretty sophisticated mathematical

formula that can help see the distance between your different data points. Now, imagine you had millions of dogs and

you wanted to classify them based on their breed. To start out, we might want to create two

key characteristics. These will help you classify dogs that share

the same breed. These are often called predictors. So, let’s use their weight and the length

of their hair. Now let’s take these two characteristics and

put them on an X-Y axis diagram. Let’s put the length of their hair along the

Y-axis and their weight along the X-axis. Let’s take 1,000 classified dogs in our training

set. We’ll put them on the graph based on their

weight and hair length. Now let’s take our unknown dog and put it

in the same chart. You can see that it’s not matched with another

dog, but it has a bunch of close neighbors. Let’s say we use a K of five. That means we’d want to put a circle around

our unclassified dog and its five closest neighbors. You can see that if the distance of the other

dogs is shorter, you’ll probably get a much more accurate classification. Now let’s look at the five closest neighbors. You’ll see that three of them are Huskies

and two of them are Shepherds You can be somewhat confident to classify your unknown dog as

a Husky. k-Nearest Neighbor is a very common and powerful

machine learning algorithm. That’s because it can do more than just sort

dogs. In fact, it’s commonly used in finance to

look for the best stocks and even predict future performance.