Machine Learning 101: Outliers introduction

by mark · Published 20 May 2020 · Updated 20 May 2020

Building a Machine Learning might become easier by the day, but there’s a rule of thumb: garbage input equals garbage predictions. Outliers are observation that significantly differ from other observations. Having outliers in your data will hinder you models, let’s discover what they are, how to detect them and remove them.

A first encounter with Outliers

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.Wikipedia, the free encyclopedia (page)

As mentioned, an outlier is somewhat of an anomaly in data. Most of real-world data that you will encounter during your journey in data science and machine learning will include some outliers. Many Machine Learning models struggle with outliers, models that are not sensitive to outliers are called robust models.

It is therefore essential for any data scientist to know what an outlier looks like, how to detect and remove outliers, and most importantly why this should be done. In the following notebook I will show you a common method used to detect (and remove) outliers. This method is based on Quartiles and the Interquartile Range (IQR), it is integrated in most plotting libraries and it is known as “Tukey’s Fences“.

Outliers in fish

The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.

(Basic) Explanatory Data Analysis

Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.

As the purpose of this notebook is to illustrate what outliers are it will only cover this topic.

Firstly, let's import basic utilities:

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Some font improvements
plt.rc('font', size=14)
plt.rc('axes', titlesize=14)
plt.rc('axes', labelsize=14)

Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):

In [2]:

df = pd.read_csv('Fish.csv')

Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.

In [3]:

df.head()

Out[3]:

	Species	Weight	Length1	Length2	Length3	Height	Width
0	Bream	242.0	23.2	25.4	30.0	11.5200	4.0200
1	Bream	290.0	24.0	26.3	31.2	12.4800	4.3056
2	Bream	340.0	23.9	26.5	31.1	12.3778	4.6961
3	Bream	363.0	26.3	29.0	33.5	12.7300	4.4555
4	Bream	430.0	26.5	29.0	34.0	12.4440	5.1340

Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation

In [4]:

df.describe()

Out[4]:

	Weight	Length1	Length2	Length3	Height	Width
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	398.326415	26.247170	28.415723	31.227044	8.970994	4.417486
std	357.978317	9.996441	10.716328	11.610246	4.286208	1.685804
min	0.000000	7.500000	8.400000	8.800000	1.728400	1.047600
25%	120.000000	19.050000	21.000000	23.150000	5.944800	3.385650
50%	273.000000	25.200000	27.300000	29.400000	7.786000	4.248500
75%	650.000000	32.700000	35.500000	39.650000	12.365900	5.584500
max	1650.000000	59.000000	63.400000	68.000000	18.957000	8.142000

Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.

In [5]:

sns.pairplot(df, hue='Species', palette='viridis')

Out[5]:

<seaborn.axisgrid.PairGrid at 0x1619e2375c8>

Do you notice something strange? Is there some observation that catches your eye?

Outliers detection using matplotlib/seaborn

Boxplots offer immediate visual feedback to isolate outliers. Outliers are plotted as little diamonds. Can you spot something in the following plots?

In [6]:

fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(15, 18), sharey=False)
sns.boxplot(data=df, y='Weight', x='Species', ax=axes[0][0])
sns.boxplot(data=df, y='Width', x='Species', ax=axes[0][1])
sns.boxplot(data=df, y='Height', x='Species', ax=axes[1][0])
sns.boxplot(data=df, y='Length1', x='Species', ax=axes[1][1])
sns.boxplot(data=df, y='Length2', x='Species', ax=axes[2][0])
sns.boxplot(data=df, y='Length3', x='Species', ax=axes[2][1])

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x1619fa0d248>

Outliers detection using Tukey's Fences

$$ \mathcal{[Q_1 - k(Q_3-Q_1), Q_3 + k(Q_3-Q_1]} $$

Tukey's Fences define a convenient interval for inliers (not outliers). Everything outside the two extremes, should be considered an outlier. $Q_1$ is the first quartile, $Q_3$ is the third quartile, $Q_3-Q_1$ is the Inter-Quartile Range (IQR). $k$ is an arbitrary constant, Tukey proposed 1.5 for outliers, and 3 for "far" outliers. This method is the same used by matplotlib and seaborn with k=1.5 when plotting boxplots.

Let's define the following function to achieve the desired behavior.

In [7]:

def outliers(df, index, k=1.5):
    q1 = df[index].quantile(0.25)
    q3 = df[index].quantile(0.75)
    iqr = q3-q1
    outliers = df[(df[index] < q1-k*iqr) | (df[index] > q3+k*iqr)]
    return outliers

Let's now take a look at Roach weights.

In [8]:

sns.boxplot(data=df[df['Species'] == 'Roach'], y='Weight', width=0.2)

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x1619fe85188>

In [9]:

outliers(df[df['Species'] == 'Roach'], 'Weight')

Out[9]:

	Species	Weight	Length1	Length2	Length3	Height	Width
40	Roach	0.0	19.0	20.5	22.8	6.4752	3.3516
52	Roach	290.0	24.0	26.0	29.2	8.8768	4.4968
54	Roach	390.0	29.5	31.7	35.0	9.4850	5.3550

As you can see there are three outliers among the Roach species when you consider their weight. The first one doesn't even have a weight, so it's an obvious outlier. But what happens if we take a look at the wight of all the species?

In [10]:

outliers(df, 'Weight')

Out[10]:

	Species	Weight	Length1	Length2	Length3	Height	Width
142	Pike	1600.0	56.0	60.0	64.0	9.600	6.144
143	Pike	1550.0	56.0	60.0	64.0	9.600	6.144
144	Pike	1650.0	59.0	63.4	68.0	10.812	7.480

The three Roach outliers are now not detected anymore. That is because considering all the species they now fall within the two Tukey's Fences. So are they actually outliers? The answer to this question isn't easy. The fact is that you must always look at the data, if there was some kind of magic algorithm that could delete all outliers without failing it would be at the very base of every ML algorithm, right? You must not overdo when discarding observation, your goal should not be to eliminate the diamonds from each and every boxplot you can think of.

Let's take a look at the other dimensions and their outliers.

In [11]:

fig, axes = plt.subplots(ncols=2, figsize=(15, 5), sharey=False)
sns.boxplot(data=df.drop('Weight', axis=1), ax=axes[0])
sns.boxplot(data=df, y='Weight', x='Species')

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x161a0426e48>

Let's take a look once again at the Roach species, this time their Lenght2 feature:

In [12]:

outliers(df[df['Species'] == 'Roach'], 'Length2')

Out[12]:

	Species	Weight	Length1	Length2	Length3	Height	Width
35	Roach	40.0	12.9	14.1	16.2	4.1472	2.268
54	Roach	390.0	29.5	31.7	35.0	9.4850	5.355

As you can see the fish that didn't have a weight associated is not here, however the fish with weight equals 390 is. That is an indicator that the observation is indeed an outlier.

Removing outliers

You can use the following function if you want to remove outliers based on an entire feature. It will return the entire DataFrame without the outliers.

In [13]:

def drop_outliers(df, index, k=1.5):
    return df.drop(outliers(df, index, k).index)

In [14]:

df = drop_outliers(df, 'Weight')

If you want to remove outliers based on a class and on a feature (such as weight, species) you will have to do something a bit more complicated:

In [15]:

df = df.drop(outliers(df[df['Species'] == 'Roach'], 'Weight').index)

In [16]:

fig, axes = plt.subplots(ncols=2, figsize=(15, 5), sharey=False)
sns.boxplot(data=df.drop('Weight', axis=1), ax=axes[0])
sns.boxplot(data=df, y='Weight', x='Species')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x161a05a0b08>

Conclusion

As you can see removing outliers in this way has created other outliers in the Pike species when it comes to weight. Outliers can be harmful to Machine Learning models, but removing them without care may reduce your dataset considerably.

Image courtesy of mark | marksei

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

Machine Learning 101: Outliers introduction

A first encounter with Outliers

Outliers in fish

(Basic) Explanatory Data Analysis

Outliers detection using matplotlib/seaborn

Outliers detection using Tukey's Fences

Removing outliers

Conclusion

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

Machine Learning 101: Outliers introduction

A first encounter with Outliers

Outliers in fish

(Basic) Explanatory Data Analysis

Outliers detection using matplotlib/seaborn

Outliers detection using Tukey's Fences

Removing outliers

Conclusion

Related posts:

You may also like...

Machine Learning 101: Logistic Regression in Python

Machine Learning 101: K-Nearest Neighbors in Python (Classification)

Machine Learning 101: Evaluating regression models, MAE, MSE, RMSE, R-squared explained

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials