In machine learning, one of the best ways
to learn more about your data is by classifying it with what you already know. You can group together a bunch of data based
on similar characteristics. Because you already know these characteristics,
you can classify the data using supervised machine learning. A very common supervised machine learning
algorithm for multiclass classification is k-Nearest Neighbor. This is an instance-based machine learning
algorithm, or what’s also called lazy learning. With lazy learning, the bulk of the computation
happens right before you want to classify your data. The learning doesn’t happen continuously. Instead, you run all your computation in one
big instance. In a sense, you’re saving up all your energy
for one big splash. k-NN compares something you don’t know to
what you already have. So, in some ways, you’re getting immediately
rewarded for the size and quality of your training data. The downside is that this takes a lot of computational
power, so sometimes it’s difficult to use k-NN on very large data sets. Think of it this way. For a veterinary doctor, one of the most difficult
jobs was trying to classify the breed of each new dog. There are hundreds of different known dog
breeds. Not only that, the dogs aren’t that close-minded
about who they breed with, so you have a lot of mixed breeds. Each time when there is a new dog, the veterinary
doctor would hold it up to several of the existing dogs that were already classified. Then we’d look at some of the characteristics. Maybe it was the shape of their face or the
color of their hair. In a sense, the purpose was trying to classify
the unknown dog by looking for its nearest neighbor. Another way to look at it is you’re trying
to minimize the distance between the unknown dog and the known breeds. If the characteristics were closely matched,
then there was a very short distance between the unknown dog and its nearest neighbor. Minimizing the distance is a key part of k-Nearest
Neighbor. The closer you are to your nearest neighbors,
the more likely you are to be accurate. The most common way to do this is through
something called Euclidean distance. There is a pretty sophisticated mathematical
formula that can help see the distance between your different data points. Now, imagine you had millions of dogs and
you wanted to classify them based on their breed. To start out, we might want to create two
key characteristics. These will help you classify dogs that share
the same breed. These are often called predictors. So, let’s use their weight and the length
of their hair. Now let’s take these two characteristics and
put them on an X-Y axis diagram. Let’s put the length of their hair along the
Y-axis and their weight along the X-axis. Let’s take 1,000 classified dogs in our training
set. We’ll put them on the graph based on their
weight and hair length. Now let’s take our unknown dog and put it
in the same chart. You can see that it’s not matched with another
dog, but it has a bunch of close neighbors. Let’s say we use a K of five. That means we’d want to put a circle around
our unclassified dog and its five closest neighbors. You can see that if the distance of the other
dogs is shorter, you’ll probably get a much more accurate classification. Now let’s look at the five closest neighbors. You’ll see that three of them are Huskies
and two of them are Shepherds You can be somewhat confident to classify your unknown dog as
a Husky. k-Nearest Neighbor is a very common and powerful
machine learning algorithm. That’s because it can do more than just sort
dogs. In fact, it’s commonly used in finance to
look for the best stocks and even predict future performance.

Leave a Reply

Your email address will not be published. Required fields are marked *