Machine Learning 101: Logistic Regression in Python

by mark · Published 15 April 2020 · Updated 31 May 2020

Logistic Regression is one of the most basic ML algorithms. While it is a basic algorithm it isn’t always included in ML courses because of its weaknesses.

Classification vs Regression

Machine Learning essentially deals with two kinds of problems:

Classification: predicting a class, for example whether a user is male or female (the two classes) given their history of purchased items.
Regression: predicting a value, for example the price (the value) of a used car given the model, the age, the kilometers on the odometer.

It is important to remember that Machine Learning is no magic, ML algorithms are still algorithms: multiple inputs, one output. The most important difference between a traditional algorithm and an ML one is the “experience” the ML algorithm gains during the training phase.

In Classification problems the algorithm tries to predict the class the entry will fall into, it may be two classes (such as the example above, male versus female) or more than two classes. The former is often called Binary Classification the latter is referred to as Multiclass Classification.

In Regression there is no class to predict, instead there is a scale and the algorithm tries to predict the value on that scale. In the example above the price is the sought value.

Logistic Regression in Python

Explaining what Logistic Regression is without delving too much into mathematics is actually quite difficult. Know that this technique is pretty similar to Linear Regression when it comes to method, but the underlying function is not a line, it instead is the logistic function (often called sigmoid function):

$$f(x) = \frac{L}{1 + e^{-k(x-x_0)}}$$

Logistic regression using fish (classification problem)

The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.

(Basic) Explanatory Data Analysis

Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.

As the purpose of this notebook is to illustrate Logistic Regression applied to a Classification Problem, the performed EDA will outline just the basic features of the dataset.

Firstly, let's import basic utilities:

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(101) # This is needed so that if you run this notebook again you will get the same results

Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):

In [2]:

df = pd.read_csv('Fish.csv')

Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.

In [3]:

df.head()

Out[3]:

	Species	Weight	Length1	Length2	Length3	Height	Width
0	Bream	242.0	23.2	25.4	30.0	11.5200	4.0200
1	Bream	290.0	24.0	26.3	31.2	12.4800	4.3056
2	Bream	340.0	23.9	26.5	31.1	12.3778	4.6961
3	Bream	363.0	26.3	29.0	33.5	12.7300	4.4555
4	Bream	430.0	26.5	29.0	34.0	12.4440	5.1340

Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation

In [4]:

df.describe()

Out[4]:

	Weight	Length1	Length2	Length3	Height	Width
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	398.326415	26.247170	28.415723	31.227044	8.970994	4.417486
std	357.978317	9.996441	10.716328	11.610246	4.286208	1.685804
min	0.000000	7.500000	8.400000	8.800000	1.728400	1.047600
25%	120.000000	19.050000	21.000000	23.150000	5.944800	3.385650
50%	273.000000	25.200000	27.300000	29.400000	7.786000	4.248500
75%	650.000000	32.700000	35.500000	39.650000	12.365900	5.584500
max	1650.000000	59.000000	63.400000	68.000000	18.957000	8.142000

Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.

In [5]:

sns.pairplot(df, hue='Species', palette='viridis')

Out[5]:

<seaborn.axisgrid.PairGrid at 0x1e5dff1c748>

As you can see there is quite a strong pattern between: Length1 and Length2, Length2 and Length3, Length3 and Length1.

In the next step we will try to predict whether the fish is a Perch or a Bream, so let's get a better look at each feature for each species.

In [6]:

fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(20, 10), sharey=False)
sns.boxplot(data=df, y='Weight', x='Species', ax=axes[0][0])
sns.boxplot(data=df, y='Width', x='Species', ax=axes[0][1])
sns.boxplot(data=df, y='Height', x='Species', ax=axes[0][2])
sns.boxplot(data=df, y='Length1', x='Species', ax=axes[1][0])
sns.boxplot(data=df, y='Length2', x='Species', ax=axes[1][1])
sns.boxplot(data=df, y='Length3', x='Species', ax=axes[1][2])

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e5e1870c88>

As you can see Length1, Length2 and Length3 follow essentially the same pattern across each species (hence they contain almost the same information). On the other hand each species has a set of Width, Height and Weight.

As the dataset is quite small it may be worth knowing how many observations there are for each fish species:

In [7]:

sns.countplot(data=df, x='Species')

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e5e1cbea88>

Logistic regression (binary classification): Let's predict whether the fish is a Perch or a Bream

In this step a Logistic Regression model to predict the fish species will be created. In order to make it simpler for the model to make predictions we will restrict the domain of species to Perch and Bream species. The model will be able to tell a Perch apart from a Bream, but any other fish it will have no clue.

The following lines allow us to create a X variable containing all the features except "Species", and a y variable containing just the species. Both variables will only include observations about Perch and Bream species.

In [8]:

X = df[(df['Species'] == 'Perch') | (df['Species'] == 'Bream')].drop('Species', axis=1)
y = df[(df['Species'] == 'Perch') | (df['Species'] == 'Bream')]['Species'].replace(['Bream', 'Perch'], [0, 1])

Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation. The model will not see the test part during its training.

In [9]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)

Without delving too much into mathematics, using Logistic Regression involves using a method to solve the equations. In order to make said method converge (find the solution) faster, data is usually standardized. You can standardize any dataset by doing:

$$\mathcal{z}=\frac{x-\mu}{\sigma}$$

Where $\mu$ is the mean average and $\sigma$ is the standard deviation of each particular column. You can work the math for each column and assign them using pandas features, or you can leverage sklearn StandardScaler. The StandardScaler will first need to be fitted (calculate $\mu$ and $\sigma$). Then the scaler will be used to scale X_train and X_test. You don't need to scale the y variable (it also only has boolean values in it (0, 1). Another thing you don't want to do is fit the scaler on train and test data, by doing this you will throw off the parameters and "leak" information (you will inherently tell bits of how the X_test looks like). Once the scaler has been fitted to the data, the data can be transformed using the formula above.

In [10]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Import the model and instantiate it:

In [11]:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

Now let's train the model:

In [12]:

lr.fit(X_train, y_train)

Out[12]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In order to review our model we will use two handy functions: the classification report (which sums up useful statistics, and the confusion matrix.

In [13]:

from sklearn.metrics import classification_report, confusion_matrix

In [14]:

print(classification_report(y_test, lr.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      1.00      1.00        17

    accuracy                           1.00        31
   macro avg       1.00      1.00      1.00        31
weighted avg       1.00      1.00      1.00        31

While discussing about the matrics deserves a whole article by itself, it is easy to observe the model is a perfect fit, achieving 100% precision and recall. The "support" column tells you how many observation support the class (0, 1). Let's now take a look at the confusion matrix:

In [15]:

print(confusion_matrix(y_test, lr.predict(X_test)))

[[14  0]
 [ 0 17]]

While the classification report allows for a quick review of the model performance, the confusion matrix takes a bit more to be understood.

The upper left number represents true positives (they are 1, classified as 1)
The upper right number represents false positives (they are 0, classified as 1)
The lower left number represents false negatives (they are 1, classified as 0)
The lower right number represents true negatives (they are 0, classified as 0)

Conclusion: It's prefect!

Although the model can only predict whether the fish is a Perch or a Bream, it seems to be working perfectly, but is it really? First of all as you can saw during the EDA, Breams and Perches are essentially different fishes. You should also keep in mind the input of the model (weight, length1, length2, length3, width and height) is essentially shared among ALL fishes in the world. Imagine you find a fish whose features are essentially the same of a Bream but it is not a Bream. Well, the model would classify it as a Bream, making an error. Since the model trained on Bream and Perch species only (albeit with a very small dataset), you should be sure you're trying to classify one of the two.

Logistic Regression (multiclass problem): Let's predict the species

This problem closely resembles the last one, yet we're now trying to predict the species. There are seven species in this dataset and there aren't many observations for some of them.

In [16]:

X = df.drop(['Species'], axis=1)
y = df['Species']

In [17]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)

In [18]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [19]:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [20]:

lr.fit(X_train, y_train)

Out[20]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Let's now evaluate the model:

In [21]:

from sklearn.metrics import classification_report

In [22]:

print(classification_report(y_test, lr.predict(X_test), zero_division=0))

              precision    recall  f1-score   support

       Bream       0.90      1.00      0.95         9
      Parkki       1.00      0.67      0.80         6
       Perch       0.71      1.00      0.83        20
        Pike       1.00      1.00      1.00         5
       Roach       0.00      0.00      0.00         7
       Smelt       1.00      1.00      1.00         5
   Whitefish       0.00      0.00      0.00         1

    accuracy                           0.81        53
   macro avg       0.66      0.67      0.65        53
weighted avg       0.72      0.81      0.75        53

Where's the confusion matrix? If you print the confusion matrix for a multiclass problem you will see it is actually quite difficult to interpret its results, stick to the classification report for easier-to-understand values. For the curious ones:

In [23]:

print(confusion_matrix(y_test, lr.predict(X_test)))

[[ 9  0  0  0  0  0  0]
 [ 1  4  0  0  1  0  0]
 [ 0  0 20  0  0  0  0]
 [ 0  0  0  5  0  0  0]
 [ 0  0  7  0  0  0  0]
 [ 0  0  0  0  0  5  0]
 [ 0  0  1  0  0  0  0]]

Conclusion: Not so accurate now?

As the previous model achieved 100% accuracy and recall, the second one achieved an overall accuracy of 0.81. As you can see as the number of species grow and the features overlap, the model struggles to distinguish between fish species with some species such as Roach and Whitefish being mispredicted all of the times. Also Perches and Breams that were previously classified perfectly are now not as certain as before. The size of the dataset, the presence of outliers, the weaknesses of the model all show up together in the last report. Know that Logistic Regression may not be the best algorithm to use in this case.

Conclusion

You now know what Logistic Regression is about, how to use it and hopefully when you shouldn’t use it (mutliclass problems).

Image courtesy of mark | marksei

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

Machine Learning 101: Logistic Regression in Python

Classification vs Regression

Logistic Regression in Python

Logistic regression using fish (classification problem)

(Basic) Explanatory Data Analysis

Logistic regression (binary classification): Let's predict whether the fish is a Perch or a Bream

Conclusion: It's prefect!

Logistic Regression (multiclass problem): Let's predict the species

Conclusion: Not so accurate now?

Conclusion

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

Machine Learning 101: Logistic Regression in Python

Classification vs Regression

Logistic Regression in Python

Logistic regression using fish (classification problem)

(Basic) Explanatory Data Analysis

Logistic regression (binary classification): Let's predict whether the fish is a Perch or a Bream

Conclusion: It's prefect!

Logistic Regression (multiclass problem): Let's predict the species

Conclusion: Not so accurate now?

Conclusion

Related posts:

You may also like...

Machine Learning 101: Outliers introduction

Machine Learning 101: Linear Regression in Python

Machine Learning 101: K-Nearest Neighbors in Python (Classification)

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials