6 min read

Data Science via VS Code. Part 4: Performing Logistic Recession on Target Data

Data Science via VS Code. Part 4: Performing Logistic Recession on Target Data

<-- Part 3 is this way

Whew! Data is in, data is cleaned, virtual environment is up, and we have executed more python commands with a working tour of Data Wrangler.

In the last post we were working toward a logistic regression over the classic titanic dataset. We decided to explore Survival as the dependent variable (DV), as impacted by the independent variables (IVs) of Gender and Age, written as:

Probability (Pr) of Survival as a product of Gender and Age.

Designer (8)

I’d throw it all in too if I had a calculator like that

I did mention a cringe factor for the script simplicity above. So here is the actual breakdown:

image-20240808-072942

All credit to Mark Bounthavong, who created the both the simple logistic regression model (the type we are using in this example) as well as the multivariable model in R. The materials are all available on the page or associated GitHub and well worth a read:

Logistic regression in R — Mark Bounthavong (mbounthavong.com)

The Goal

Professors appeased, so far, we have sourced, loaded, transformed and saved a dataframe. The frame has been cleaned with the missing values removed. The entire code is captured in our python notebook, running in a virtual environment.

From here, I would like to add the scripting required to complete the logistic regression in the python notebook, before assessing the findings and documenting them in a markdown cell in the same notebook.

This will then provide us with an end-to-end working example of a logistic regression with the historic dataset.

The Plan

To achieve our goal, here is a step by step plan:
In the terminal:

  1. Install the libraries.

In the python notebook:

  1. Import necessary libraries.

  2. Prepare the data: Ensure dataframe_clean has the necessary columns (Survived, Age, SexCode).

  3. Split the data into features (X) and target (y).

  4. Train-test split to evaluate the model.

  5. Standardize the features if necessary.
    Fit the logistic regression model.

  6. Extract coefficients and calculate odds ratios.

  7. Evaluate the model using accuracy and other metrics.

  8. Document the results in MarkDown.

Quick re-baseline steps:

  1. reload VSCode

  2. reopen the Folder (if needed - see part 1 and part 2 of the series)

  3. load up the virtual environment (if needed - type .venv\scripts\activate - in the terminal)

Run the Regression Plan

After a review of the SciKit Learn documentation, I know I will need SciKit Learn and numpy.

Install them by using the following code in the terminal:

pip install scikit-learn numpy

Warning - this will pull all of the dependencies down:

Dependency

 

Minimum Version

 

Purpose

 

numpy

1.19.5

build, install

scipy

1.6.0

build, install

joblib

1.2.0

install

threadpoolctl

3.1.0

install

cython

3.0.10

build

meson-python

0.16.0

build

matplotlib

3.3.4

benchmark, docs, examples, tests

scikit-image

0.17.2

docs, examples, tests

pandas

1.1.5

benchmark, docs, examples, tests

seaborn

0.9.0

docs, examples

memory_profiler

0.57.0

benchmark, docs

pytest

7.1.2

tests

pytest-cov

2.9.0

tests

ruff

0.2.1

tests

black

24.3.0

tests

mypy

1.9

tests

pyamg

4.0.0

tests

polars

0.20.23

docs, tests

pyarrow

12.0.0

tests

sphinx

7.3.7

docs

sphinx-copybutton

0.5.2

docs

sphinx-gallery

0.16.0

docs

numpydoc

1.2.0

docs, tests

Pillow

7.1.2

docs

pooch

1.6.0

docs, examples, tests

sphinx-prompt

1.4.0

docs

sphinxext-opengraph

0.9.1

docs

plotly

5.14.0

docs, examples

sphinxcontrib-sass

0.3.4

docs

sphinx-remove-toctrees

1.0.0.post1

docs

sphinx-design

0.5.0

docs

pydata-sphinx-theme

0.15.3

docs

conda-lock

2.5.6

maintenance

This will bump the local folder size from ~250mb to a new total of ~450mb.

# Code Implementation

# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Prepare the data
# Assuming dataframe_clean is already defined and cleaned
# Ensure the dataframe has the necessary columns
assert 'Survived' in dataframe_clean.columns
assert 'Age' in dataframe_clean.columns
assert 'SexCode' in dataframe_clean.columns

# Step 3: Split the data into features (X) and target (y)
X = dataframe_clean[['Age', 'SexCode']]
y = dataframe_clean['Survived']

# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Standardize the features (optional but recommended)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 6: Fit the logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Step 7: Extract coefficients and calculate odds ratios
coefficients = model.coef_[0]
odds_ratios = pd.Series(coefficients).apply(lambda x: np.exp(x))

# Step 8: Evaluate the model
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Print results
print(f"Coefficients: {coefficients}")
print(f"Odds Ratios: {odds_ratios}")
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n{report}")

Run the code - it should appear as the following - with the results displayed in line below the code block:

 image-20240808-070201
 

Interpret the results of the regression

Note the results at the bottom of the screenshot above - the Coefficients are returned in the order of entry on step 3 - Age, SexCode. The Odds Ratios for 0 and 1 are for the Age and SexCode variables respectively.

The overall metrics for the model are presented on the bottom 3 rows of the output, with the model accuracy of 0.78. The model accurately predicts survival rate about 77.6% of the time (the top accuracy measure).

Document the results of the regression

If you have previously worked with a logistic regression output, the interpretation of the above results should be fairly straight forward. If you haven’t - I have documented the findings below in a markdown format that we can add to the notebook.

Add a Markdown block to the python notebook

You can add our interpretation of the report, using some basic markdown format, an open source text formatting standard.

image-20240808-071544

Add a markdown box below he results by hovering your mouse over the bottom centre of the notebook (under the results output) and selecting + Markdown it when it appears.

Add the results to the Markdown block

Copy the code below for a formatted results section.

## Interpretation of the Logistic Regression Results
### Coefficients

Age Coefficient: -0.15802868

This negative coefficient indicates that as age increases, the log-odds of survival decrease. In other words, older passengers are less likely to survive.

SexCode Coefficient: 1.19358149

This positive coefficient indicates that being coded as 1 (female) increases the log-odds of survival.

### Odds Ratios

Age Odds Ratio: 0.853825

An odds ratio less than 1 (0.85) means that for each one-unit increase in age, the odds of survival decrease by approximately 15%.

SexCode Odds Ratio: 3.298875

An odds ratio greater than 1 (3.30) means that being coded as 1 (i.e. a female passenger) increases the odds of survival by approximately 230%.

### Model Accuracy
Accuracy: 0.7763157894736842

The model correctly predicts survival about 77.6% of the time.

### Classification Report

#### Precision, Recall, and F1-Score for Class 0 (Not Survived):
- Precision: 0.79 - When the model predicts not survived, it is correct 79% of the time.
- Recall: 0.84 - The model correctly identifies 84% of the actual not survived cases.
- F1-Score: 0.82 - The harmonic mean of precision and recall, indicating a good balance between the two.
#### Precision, Recall, and F1-Score for Class 1 (Survived):
- Precision: 0.75 - When the model predicts survived, it is correct 75% of the time.
- Recall: 0.68 - The model correctly identifies 68% of the actual survived cases.
- F1-Score: 0.72 - The harmonic mean of precision and recall, indicating a reasonable balance between the two.
#### Overall Metrics:
- Accuracy: 0.78 - The overall accuracy of the model.
- Macro Avg: Averages of precision, recall, and F1-score across both classes.
- Weighted Avg: Averages of precision, recall, and F1-score, weighted by the number of instances in each class.

### Summary
The logistic regression model shows that age negatively impacts the likelihood of survival, while the sex code positively impacts it. The model has a reasonable accuracy of 77.6%, with good precision and recall for both classes. The odds ratios provide a clear interpretation of how each feature affects the odds of survival.

When you have added the text, select the stop editing cell (tick button - top right) or press the escape key.

image-20240808-071822

The Markdown cell will then display the format.

image-20240808-071857

Thanks for playing!

This was quite the process - I hope you found it as interesting! We aimed to provide a baseline understanding and hands-on experience in a hands-off blog-post way!

For ease of access going forward / in the event you have an issue with your code I’ve included a full extract of my final code in the section below.

Thanks again !

-Sam

Bonus round! 

Identify and document the packages in your Python virtual environment for reproducibility.

To ensure the results can be replicated, we need to capture the installed packages. We started with pandas. Then installed SciKit Learn and numpy with dependencies - where did we end up?

In the terminal type:

freeze > requirements.txt

A text file will be created in the base of the .venv folder named ‘requirements.txt’

image-20240808-072424

Double click the file to open it. Wow! We have 36 packages installed!

asttokens==2.4.1
colorama==0.4.6
comm==0.2.2
debugpy==1.8.2
decorator==5.1.1
executing==2.0.1
ipykernel==6.29.5
ipython==8.26.0
jedi==0.19.1
joblib==1.4.2
jupyter_client==8.6.2
jupyter_core==5.7.2
matplotlib-inline==0.1.7
nest-asyncio==1.6.0
numpy==2.0.1
packaging==24.1
pandas==2.2.2
parso==0.8.4
platformdirs==4.2.2
prompt_toolkit==3.0.47
psutil==6.0.0
pure_eval==0.2.3
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
pywin32==306
pyzmq==26.0.3
scikit-learn==1.5.1
scipy==1.14.0
six==1.16.0
stack-data==0.6.3
threadpoolctl==3.5.0
tornado==6.4.1
traitlets==5.14.3
tzdata==2024.1
wcwidth==0.2.13

This is a really important file - keeping it in the folder will allow you to replicate the tests in future, ensuring the environmental dependencies are met. You can share the file with others so that they can replicate your configuration and assist with trouble shooting, or use this as a cornerstone for standardised environments in Azure Machine Learning if you want to productionise your approach.

 

Data Science via VS Code. Part 2: Initial Libraries and Data Import

Data Science via VS Code. Part 2: Initial Libraries and Data Import

If this is the first post you have opened, I recommend you jump back to the Part 1. Install VS Code, relevant extensions and create a virtual...

Read More
Data Science via VS Code. Part 3: DataFrame with some basic exploratory tasks

Data Science via VS Code. Part 3: DataFrame with some basic exploratory tasks

Part 1: install, extensions, virtual env. Part 2: Initial Libraries and Data Import Whew! Data is in, virtual environment is up, and we have executed...

Read More
Data Science via VS Code. Part 1: install, extensions, virtual env.

Data Science via VS Code. Part 1: install, extensions, virtual env.

Welcome to the mini-blog series on data science in Visual Studio (VS) Code!

Read More