Machine Learning 101: K-Nearest Neighbors in Python (Classification)

by mark · Published 13 May 2020 · Updated 13 May 2020

K-Nearest Neighbors is a double-edged algorithm that can be used for both Classification and Regression problems. While its intuition is quite simple, it can yield impressive results, but there are a few drawbacks, hence double-edged.

Classification vs Regression

Machine Learning essentially deals with two kinds of problems:

Classification: predicting a class, for example whether a user is male or female (the two classes) given their history of purchased items.
Regression: predicting a value, for example the price (the value) of a used car given the model, the age, the kilometers on the odometer.

It is important to remember that Machine Learning is no magic, ML algorithms are still algorithms: multiple inputs, one output. The most important difference between a traditional algorithm and an ML one is the “experience” the ML algorithm gains during the training phase.

In Classification problems the algorithm tries to predict the class the entry will fall into, it may be two classes (such as the example above, male versus female) or more than two classes. The former is often called Binary Classification the latter is referred to as Multiclass Classification.

In Regression there is no class to predict, instead there is a scale and the algorithm tries to predict the value on that scale. In the example above the price is the sought value.

K-Nearest Neighbors (Classification) in Python

The intuition behind K-Nearest Neighbors is quite simple: if the observation is close to others observations (with a minimum of K) it must be of the same kind.

Imagine a square where three stalls are performing different magic tricks, there are illusionists, fire-dancers and tamers.

Now imagine a worker for one of the three stalls that has just come and he’s late, he goes in the center of the square and finds its stall then takes a step.

You can use KNN to determine which stall the worker will go to by calculating the distance between the worker and all the other workers. If there are enough colleagues (let’s say 2) within his range and they’re the closest it will mean he probably has seen its colleagues and he belongs there. Now this seems probably obvious since he has taken a step towards the stall, but the real difference here is the distance between the worker and each other worker in the square.

By taking a step towards his colleagues the worker has decreased the distance between himself and its colleagues and has increased the distance between himself and the other groups.

You might also have noticed an important thing about KNN, k is chosen by you and the right number for K actually depends on the data.

The distance is therefore essential to determine whether a new observation will befall a certain category or not. Historically the Euclidean Distance has been used while nowadays the Minkowski Distance is used (it is a generalization of both Euclidean and Manhattan distances). Here’s the formula to calculate the Euclidean Distance between p and q in an n-dimensional space:

$$d\left(p,q\right) = \sqrt{\sum _{i=1}^{n} \left(q_{i}-p_{i}\right)^2}$$

K-Nearest Neighbors using fish (classification problem)

The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.

(Basic) Explanatory Data Analysis

Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.

As the purpose of this notebook is to illustrate K-Nearest Neighbors applied to a Classification Problem, the performed EDA will outline just the basic features of the dataset.

Firstly, let's import basic utilities:

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(101) # This is needed so that if you run this notebook again you will get the same results

Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):

In [2]:

df = pd.read_csv('Fish.csv')

Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.

In [3]:

df.head()

Out[3]:

	Species	Weight	Length1	Length2	Length3	Height	Width
0	Bream	242.0	23.2	25.4	30.0	11.5200	4.0200
1	Bream	290.0	24.0	26.3	31.2	12.4800	4.3056
2	Bream	340.0	23.9	26.5	31.1	12.3778	4.6961
3	Bream	363.0	26.3	29.0	33.5	12.7300	4.4555
4	Bream	430.0	26.5	29.0	34.0	12.4440	5.1340

Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation

In [4]:

df.describe()

Out[4]:

	Weight	Length1	Length2	Length3	Height	Width
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	398.326415	26.247170	28.415723	31.227044	8.970994	4.417486
std	357.978317	9.996441	10.716328	11.610246	4.286208	1.685804
min	0.000000	7.500000	8.400000	8.800000	1.728400	1.047600
25%	120.000000	19.050000	21.000000	23.150000	5.944800	3.385650
50%	273.000000	25.200000	27.300000	29.400000	7.786000	4.248500
75%	650.000000	32.700000	35.500000	39.650000	12.365900	5.584500
max	1650.000000	59.000000	63.400000	68.000000	18.957000	8.142000

Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.

In [5]:

sns.pairplot(df, hue='Species', palette='viridis')

Out[5]:

<seaborn.axisgrid.PairGrid at 0x1db797e48c8>

As you can see there is quite a strong pattern between: Length1 and Length2, Length2 and Length3, Length3 and Length1.

In the next step we will try to predict whether the fish is a Perch or a Bream, so let's get a better look at each feature for each species.

In [6]:

fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(20, 10), sharey=False)
sns.boxplot(data=df, y='Weight', x='Species', ax=axes[0][0])
sns.boxplot(data=df, y='Width', x='Species', ax=axes[0][1])
sns.boxplot(data=df, y='Height', x='Species', ax=axes[0][2])
sns.boxplot(data=df, y='Length1', x='Species', ax=axes[1][0])
sns.boxplot(data=df, y='Length2', x='Species', ax=axes[1][1])
sns.boxplot(data=df, y='Length3', x='Species', ax=axes[1][2])

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x1db7b1b3948>

As you can see Length1, Length2 and Length3 follow essentially the same pattern across each species (hence they contain almost the same information). On the other hand each species has a set of Width, Height and Weight.

As the dataset is quite small it may be worth knowing how many observations there are for each fish species:

In [7]:

sns.countplot(data=df, x='Species')

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1db7c008248>

K-Nearest Neighbours (binary classification): Let's predict whether the fish is a Perch or a Bream

In this step a KNN model to predict the fish species will be created. In order to make it simpler for the model to make predictions we will restrict the domain of species to Perch and Bream species. The model will be able to tell a Perch apart from a Bream, but any other fish it will have no clue.

The following lines allow us to create a X variable containing all the features except "Species", and a y variable containing just the species. Both variables will only include observations about Perch and Bream species.

In [8]:

X = df[(df['Species'] == 'Perch') | (df['Species'] == 'Bream')].drop('Species', axis=1)
y = df[(df['Species'] == 'Perch') | (df['Species'] == 'Bream')]['Species'].replace(['Bream', 'Perch'], [0, 1])

Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation. The model will not see the test part during its training.

In [9]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)

Without delving too much into mathematics, using KNN involves using a metric to calculate the distance between a point and its neighbours, this metric is usually the Euclidean Distance. In order for the features to be equally important, each feature should be standardized. You can standardize any dataset by doing:

$$\mathcal{z}=\frac{x-\mu}{\sigma}$$

Where $\mu$ is the mean average and $\sigma$ is the standard deviation of each particular column. You can work the math for each column and assign them using pandas features, or you can leverage sklearn StandardScaler. The StandardScaler will first need to be fitted (calculate $\mu$ and $\sigma$). Then the scaler will be used to scale X_train and X_test. You don't need to scale the y variable (it also only has boolean values in it (0, 1). Another thing you don't want to do is fit the scaler on train and test data, by doing this you will throw off the parameters and "leak" information (you will inherently tell bits of how the X_test looks like). Once the scaler has been fitted to the data, the data can be transformed using the formula above.

In [10]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Import the model and instantiate it:

In [11]:

from sklearn.neighbors import KNeighborsClassifier

knc = KNeighborsClassifier(2, p=2)

Now let's train the model:

In [12]:

knc.fit(X_train, y_train)

Out[12]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                     weights='uniform')

In order to review our model we will use two handy functions: the classification report (which sums up useful statistics, and the confusion matrix.

In [13]:

from sklearn.metrics import classification_report, confusion_matrix, f1_score

In [14]:

print(classification_report(y_test, knc.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      1.00      1.00        17

    accuracy                           1.00        31
   macro avg       1.00      1.00      1.00        31
weighted avg       1.00      1.00      1.00        31

While discussing about the matrics deserves a whole article by itself, it is easy to observe the model is a perfect fit, achieving 100% precision and recall. The "support" column tells you how many observation support the class (0, 1). Let's now take a look at the confusion matrix:

In [15]:

print(confusion_matrix(y_test, knc.predict(X_test)))

[[14  0]
 [ 0 17]]

While the classification report allows for a quick review of the model performance, the confusion matrix takes a bit more to be understood.

The upper left number represents true positives (they are 1, classified as 1)
The upper right number represents false positives (they are 0, classified as 1)
The lower left number represents false negatives (they are 1, classified as 0)
The lower right number represents true negatives (they are 0, classified as 0)

Conclusion: It's prefect!

Although the model can only predict whether the fish is a Perch or a Bream, it seems to be working perfectly, but is it really? First of all as you can saw during the EDA, Breams and Perches are essentially different fishes. You should also keep in mind the input of the model (weight, length1, length2, length3, width and height) is essentially shared among ALL fishes in the world. Imagine you find a fish whose features are essentially the same of a Bream but it is not a Bream. Well, the model would classify it as a Bream, making an error. Since the model trained on Bream and Perch species only (albeit with a very small dataset), you should be sure you're trying to classify one of the two.

K-Nearest Neighbors (multiclass problem): Let's predict the species

This problem closely resembles the last one, yet we're now trying to predict the species. There are seven species in this dataset and there aren't many observations for some of them.

In [16]:

X = df.drop(['Species'], axis=1)
y = df['Species']

In [17]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)

In [18]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [19]:

from sklearn.neighbors import KNeighborsClassifier

knc = KNeighborsClassifier(2, p=2)

In [20]:

knc.fit(X_train, y_train)

Out[20]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                     weights='uniform')

Let's now evaluate the model:

In [21]:

from sklearn.metrics import classification_report

In [22]:

print(classification_report(y_test, knc.predict(X_test), zero_division=0))

              precision    recall  f1-score   support

       Bream       0.90      1.00      0.95         9
      Parkki       1.00      0.33      0.50         6
       Perch       0.79      0.95      0.86        20
        Pike       1.00      1.00      1.00         5
       Roach       0.38      0.43      0.40         7
       Smelt       1.00      0.80      0.89         5
   Whitefish       0.00      0.00      0.00         1

    accuracy                           0.79        53
   macro avg       0.72      0.64      0.66        53
weighted avg       0.80      0.79      0.77        53

Where's the confusion matrix? If you print the confusion matrix for a multiclass problem you will see it is actually quite difficult to interpret its results, stick to the classification report for easier-to-understand values. For the curious ones:

In [23]:

print(confusion_matrix(y_test, knc.predict(X_test)))

[[ 9  0  0  0  0  0  0]
 [ 1  2  0  0  3  0  0]
 [ 0  0 19  0  1  0  0]
 [ 0  0  0  5  0  0  0]
 [ 0  0  4  0  3  0  0]
 [ 0  0  1  0  0  4  0]
 [ 0  0  0  0  1  0  0]]

Conclusion: Not so accurate now?

As the previous model achieved 100% accuracy and recall, the second one achieved an overall accuracy of 0.79. As you can see as the number of species grow and the features overlap, the model struggles to distinguish between fish species with some species such as Roach and Whitefish being mispredicted all of the times. Also Perches and Breams that were previously classified perfectly are now not as certain as before. The size of the dataset, the presence of outliers, the weaknesses of the model all show up together in the last report.

Conclusion

KNN is a great algorithm that can yield impressive results but its complexity quickly grows as the number of dimensions and observations grows. It is by many considered the go-to ML algorithm, but it should not be used blindly. In the notebook shown above the features show a linear pattern and that should be an indicator that KNN will not perform well. Since there is a strong linear correlation between many features, other techniques such as Linear Regression or Logistic Regression should come to mind first.

Image courtesy of mark | marksei

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

Machine Learning 101: K-Nearest Neighbors in Python (Classification)

Classification vs Regression

K-Nearest Neighbors (Classification) in Python

K-Nearest Neighbors using fish (classification problem)

(Basic) Explanatory Data Analysis

K-Nearest Neighbours (binary classification): Let's predict whether the fish is a Perch or a Bream

Conclusion: It's prefect!

K-Nearest Neighbors (multiclass problem): Let's predict the species

Conclusion: Not so accurate now?

Conclusion

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

Machine Learning 101: K-Nearest Neighbors in Python (Classification)

Classification vs Regression

K-Nearest Neighbors (Classification) in Python

K-Nearest Neighbors using fish (classification problem)

(Basic) Explanatory Data Analysis

K-Nearest Neighbours (binary classification): Let's predict whether the fish is a Perch or a Bream

Conclusion: It's prefect!

K-Nearest Neighbors (multiclass problem): Let's predict the species

Conclusion: Not so accurate now?

Conclusion

Related posts:

You may also like...

Machine Learning 101: Linear Regression in Python

Machine Learning 101: Outliers introduction

Machine Learning 101: Evaluating regression models, MAE, MSE, RMSE, R-squared explained

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials