6 min read

Data Science via VS Code. Part 4: Performing Logistic Recession on Target Data

Samuel Parsons : Aug 19, 2024 1:35:23 PM

data

Whew! Data is in, data is cleaned, virtual environment is up, and we have executed more python commands with a working tour of Data Wrangler.

In the last post we were working toward a logistic regression over the classic titanic dataset. We decided to explore Survival as the dependent variable (DV), as impacted by the independent variables (IVs) of Gender and Age, written as:

Probability (Pr) of Survival as a product of Gender and Age.

Designer (8)

I’d throw it all in too if I had a calculator like that

I did mention a cringe factor for the script simplicity above. So here is the actual breakdown:

image-20240808-072942

All credit to Mark Bounthavong, who created the both the simple logistic regression model (the type we are using in this example) as well as the multivariable model in R. The materials are all available on the page or associated GitHub and well worth a read:

Logistic regression in R — Mark Bounthavong (mbounthavong.com)

The Goal

Professors appeased, so far, we have sourced, loaded, transformed and saved a dataframe. The frame has been cleaned with the missing values removed. The entire code is captured in our python notebook, running in a virtual environment.

From here, I would like to add the scripting required to complete the logistic regression in the python notebook, before assessing the findings and documenting them in a markdown cell in the same notebook.

This will then provide us with an end-to-end working example of a logistic regression with the historic dataset.

The Plan

To achieve our goal, here is a step by step plan:
In the terminal:

Install the libraries.

In the python notebook:

Import necessary libraries.
Prepare the data: Ensure dataframe_clean has the necessary columns (Survived, Age, SexCode).
Split the data into features (X) and target (y).
Train-test split to evaluate the model.
Standardize the features if necessary.
Fit the logistic regression model.
Extract coefficients and calculate odds ratios.
Evaluate the model using accuracy and other metrics.
Document the results in MarkDown.

Quick re-baseline steps:

reload VSCode
reopen the Folder (if needed - see part 1 and part 2 of the series)
load up the virtual environment (if needed - type .venv\scripts\activate - in the terminal)

Run the Regression Plan

After a review of the SciKit Learn documentation, I know I will need SciKit Learn and numpy.

Install them by using the following code in the terminal:

pip install scikit-learn numpy

Warning - this will pull all of the dependencies down:

Dependency	Minimum Version	Purpose
numpy	1.19.5	build, install
scipy	1.6.0	build, install
joblib	1.2.0	install
threadpoolctl	3.1.0	install
cython	3.0.10	build
meson-python	0.16.0	build
matplotlib	3.3.4	benchmark, docs, examples, tests
scikit-image	0.17.2	docs, examples, tests
pandas	1.1.5	benchmark, docs, examples, tests
seaborn	0.9.0	docs, examples
memory_profiler	0.57.0	benchmark, docs
pytest	7.1.2	tests
pytest-cov	2.9.0	tests
ruff	0.2.1	tests
black	24.3.0	tests
mypy	1.9	tests
pyamg	4.0.0	tests
polars	0.20.23	docs, tests
pyarrow	12.0.0	tests
sphinx	7.3.7	docs
sphinx-copybutton	0.5.2	docs
sphinx-gallery	0.16.0	docs
numpydoc	1.2.0	docs, tests
Pillow	7.1.2	docs
pooch	1.6.0	docs, examples, tests
sphinx-prompt	1.4.0	docs
sphinxext-opengraph	0.9.1	docs
plotly	5.14.0	docs, examples
sphinxcontrib-sass	0.3.4	docs
sphinx-remove-toctrees	1.0.0.post1	docs
sphinx-design	0.5.0	docs
pydata-sphinx-theme	0.15.3	docs
conda-lock	2.5.6	maintenance

This will bump the local folder size from ~250mb to a new total of ~450mb.

# Code Implementation # Step 1: Import necessary libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, classification_report # Step 2: Prepare the data # Assuming dataframe_clean is already defined and cleaned # Ensure the dataframe has the necessary columns assert 'Survived' in dataframe_clean.columns assert 'Age' in dataframe_clean.columns assert 'SexCode' in dataframe_clean.columns # Step 3: Split the data into features (X) and target (y) X = dataframe_clean[['Age', 'SexCode']] y = dataframe_clean['Survived'] # Step 4: Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Step 5: Standardize the features (optional but recommended) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Step 6: Fit the logistic regression model model = LogisticRegression() model.fit(X_train_scaled, y_train) # Step 7: Extract coefficients and calculate odds ratios coefficients = model.coef_[0] odds_ratios = pd.Series(coefficients).apply(lambda x: np.exp(x)) # Step 8: Evaluate the model y_pred = model.predict(X_test_scaled) accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred) # Print results print(f"Coefficients: {coefficients}") print(f"Odds Ratios: {odds_ratios}") print(f"Accuracy: {accuracy}") print(f"Classification Report:\n{report}")

Run the code - it should appear as the following - with the results displayed in line below the code block:

Interpret the results of the regression

Note the results at the bottom of the screenshot above - the Coefficients are returned in the order of entry on step 3 - Age, SexCode. The Odds Ratios for 0 and 1 are for the Age and SexCode variables respectively.

The overall metrics for the model are presented on the bottom 3 rows of the output, with the model accuracy of 0.78. The model accurately predicts survival rate about 77.6% of the time (the top accuracy measure).

Document the results of the regression

If you have previously worked with a logistic regression output, the interpretation of the above results should be fairly straight forward. If you haven’t - I have documented the findings below in a markdown format that we can add to the notebook.

Add a Markdown block to the python notebook

You can add our interpretation of the report, using some basic markdown format, an open source text formatting standard.

image-20240808-071544

Add a markdown box below he results by hovering your mouse over the bottom centre of the notebook (under the results output) and selecting + Markdown it when it appears.

Add the results to the Markdown block

Copy the code below for a formatted results section.

## Interpretation of the Logistic Regression Results ### Coefficients Age Coefficient: -0.15802868 This negative coefficient indicates that as age increases, the log-odds of survival decrease. In other words, older passengers are less likely to survive. SexCode Coefficient: 1.19358149 This positive coefficient indicates that being coded as 1 (female) increases the log-odds of survival. ### Odds Ratios Age Odds Ratio: 0.853825 An odds ratio less than 1 (0.85) means that for each one-unit increase in age, the odds of survival decrease by approximately 15%. SexCode Odds Ratio: 3.298875 An odds ratio greater than 1 (3.30) means that being coded as 1 (i.e. a female passenger) increases the odds of survival by approximately 230%. ### Model Accuracy Accuracy: 0.7763157894736842 The model correctly predicts survival about 77.6% of the time. ### Classification Report #### Precision, Recall, and F1-Score for Class 0 (Not Survived): - Precision: 0.79 - When the model predicts not survived, it is correct 79% of the time. - Recall: 0.84 - The model correctly identifies 84% of the actual not survived cases. - F1-Score: 0.82 - The harmonic mean of precision and recall, indicating a good balance between the two. #### Precision, Recall, and F1-Score for Class 1 (Survived): - Precision: 0.75 - When the model predicts survived, it is correct 75% of the time. - Recall: 0.68 - The model correctly identifies 68% of the actual survived cases. - F1-Score: 0.72 - The harmonic mean of precision and recall, indicating a reasonable balance between the two. #### Overall Metrics: - Accuracy: 0.78 - The overall accuracy of the model. - Macro Avg: Averages of precision, recall, and F1-score across both classes. - Weighted Avg: Averages of precision, recall, and F1-score, weighted by the number of instances in each class. ### Summary The logistic regression model shows that age negatively impacts the likelihood of survival, while the sex code positively impacts it. The model has a reasonable accuracy of 77.6%, with good precision and recall for both classes. The odds ratios provide a clear interpretation of how each feature affects the odds of survival.

When you have added the text, select the stop editing cell (tick button - top right) or press the escape key.

image-20240808-071822

The Markdown cell will then display the format.

image-20240808-071857

Thanks for playing!

This was quite the process - I hope you found it as interesting! We aimed to provide a baseline understanding and hands-on experience in a hands-off blog-post way!

For ease of access going forward / in the event you have an issue with your code I’ve included a full extract of my final code in the section below.

Thanks again !

-Sam

Bonus round!

Identify and document the packages in your Python virtual environment for reproducibility.

To ensure the results can be replicated, we need to capture the installed packages. We started with pandas. Then installed SciKit Learn and numpy with dependencies - where did we end up?

In the terminal type:

freeze > requirements.txt

A text file will be created in the base of the .venv folder named ‘requirements.txt’

image-20240808-072424

Double click the file to open it. Wow! We have 36 packages installed!

asttokens==2.4.1 colorama==0.4.6 comm==0.2.2 debugpy==1.8.2 decorator==5.1.1 executing==2.0.1 ipykernel==6.29.5 ipython==8.26.0 jedi==0.19.1 joblib==1.4.2 jupyter_client==8.6.2 jupyter_core==5.7.2 matplotlib-inline==0.1.7 nest-asyncio==1.6.0 numpy==2.0.1 packaging==24.1 pandas==2.2.2 parso==0.8.4 platformdirs==4.2.2 prompt_toolkit==3.0.47 psutil==6.0.0 pure_eval==0.2.3 Pygments==2.18.0 python-dateutil==2.9.0.post0 pytz==2024.1 pywin32==306 pyzmq==26.0.3 scikit-learn==1.5.1 scipy==1.14.0 six==1.16.0 stack-data==0.6.3 threadpoolctl==3.5.0 tornado==6.4.1 traitlets==5.14.3 tzdata==2024.1 wcwidth==0.2.13

This is a really important file - keeping it in the folder will allow you to replicate the tests in future, ensuring the environmental dependencies are met. You can share the file with others so that they can replicate your configuration and assist with trouble shooting, or use this as a cornerstone for standardised environments in Azure Machine Learning if you want to productionise your approach.

Data Science via VS Code. Part 3: DataFrame with some basic exploratory tasks

Samuel Parsons : Aug 13, 2024 12:31:58 PM

Part 1: install, extensions, virtual env. Part 2: Initial Libraries and Data Import Whew! Data is in, virtual environment is up, and we have executed...

AI data

Data Science via VS Code. Part 2: Initial Libraries and Data Import

Samuel Parsons : Aug 6, 2024 1:51:48 PM

If this is the first post you have opened, I recommend you jump back to the Part 1. Install VS Code, relevant extensions and create a virtual...

data

Data Science via VS Code. Part 1: install, extensions, virtual env.

Samuel Parsons : Jul 22, 2024 9:00:00 AM

Welcome to the mini-blog series on data science in Visual Studio (VS) Code!

data