Data Science via VS Code. Part 3: DataFrame with some basic exploratory tasks
Part 1: install, extensions, virtual env. Part 2: Initial Libraries and Data Import Whew! Data is in, virtual environment is up, and we have executed...
6 min read
Samuel Parsons : Aug 19, 2024 1:35:23 PM
Whew! Data is in, data is cleaned, virtual environment is up, and we have executed more python commands with a working tour of Data Wrangler.
In the last post we were working toward a logistic regression over the classic titanic dataset. We decided to explore Survival as the dependent variable (DV), as impacted by the independent variables (IVs) of Gender and Age, written as:
Probability (Pr) of Survival as a product of Gender and Age.
I’d throw it all in too if I had a calculator like that
I did mention a cringe factor for the script simplicity above. So here is the actual breakdown:
All credit to Mark Bounthavong, who created the both the simple logistic regression model (the type we are using in this example) as well as the multivariable model in R. The materials are all available on the page or associated GitHub and well worth a read:
Logistic regression in R — Mark Bounthavong (mbounthavong.com)
Professors appeased, so far, we have sourced, loaded, transformed and saved a dataframe. The frame has been cleaned with the missing values removed. The entire code is captured in our python notebook, running in a virtual environment.
From here, I would like to add the scripting required to complete the logistic regression in the python notebook, before assessing the findings and documenting them in a markdown cell in the same notebook.
This will then provide us with an end-to-end working example of a logistic regression with the historic dataset.
To achieve our goal, here is a step by step plan:
In the terminal:
In the python notebook:
Import necessary libraries.
Prepare the data: Ensure dataframe_clean has the necessary columns (Survived, Age, SexCode).
Split the data into features (X) and target (y).
Train-test split to evaluate the model.
Standardize the features if necessary.
Fit the logistic regression model.
Extract coefficients and calculate odds ratios.
Evaluate the model using accuracy and other metrics.
Document the results in MarkDown.
Quick re-baseline steps:
reload VSCode
reopen the Folder (if needed - see part 1 and part 2 of the series)
load up the virtual environment (if needed - type .venv\scripts\activate - in the terminal)
After a review of the SciKit Learn documentation, I know I will need SciKit Learn and numpy.
Install them by using the following code in the terminal:
pip install scikit-learn numpy
Warning - this will pull all of the dependencies down:
Dependency |
Minimum Version |
Purpose |
---|---|---|
numpy |
1.19.5 |
build, install |
scipy |
1.6.0 |
build, install |
joblib |
1.2.0 |
install |
threadpoolctl |
3.1.0 |
install |
cython |
3.0.10 |
build |
meson-python |
0.16.0 |
build |
matplotlib |
3.3.4 |
benchmark, docs, examples, tests |
scikit-image |
0.17.2 |
docs, examples, tests |
pandas |
1.1.5 |
benchmark, docs, examples, tests |
seaborn |
0.9.0 |
docs, examples |
memory_profiler |
0.57.0 |
benchmark, docs |
pytest |
7.1.2 |
tests |
pytest-cov |
2.9.0 |
tests |
ruff |
0.2.1 |
tests |
black |
24.3.0 |
tests |
mypy |
1.9 |
tests |
pyamg |
4.0.0 |
tests |
polars |
0.20.23 |
docs, tests |
pyarrow |
12.0.0 |
tests |
sphinx |
7.3.7 |
docs |
sphinx-copybutton |
0.5.2 |
docs |
sphinx-gallery |
0.16.0 |
docs |
numpydoc |
1.2.0 |
docs, tests |
Pillow |
7.1.2 |
docs |
pooch |
1.6.0 |
docs, examples, tests |
sphinx-prompt |
1.4.0 |
docs |
sphinxext-opengraph |
0.9.1 |
docs |
plotly |
5.14.0 |
docs, examples |
sphinxcontrib-sass |
0.3.4 |
docs |
sphinx-remove-toctrees |
1.0.0.post1 |
docs |
sphinx-design |
0.5.0 |
docs |
pydata-sphinx-theme |
0.15.3 |
docs |
conda-lock |
2.5.6 |
maintenance |
This will bump the local folder size from ~250mb to a new total of ~450mb.
# Code Implementation
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
# Step 2: Prepare the data
# Assuming dataframe_clean is already defined and cleaned
# Ensure the dataframe has the necessary columns
assert 'Survived' in dataframe_clean.columns
assert 'Age' in dataframe_clean.columns
assert 'SexCode' in dataframe_clean.columns
# Step 3: Split the data into features (X) and target (y)
X = dataframe_clean[['Age', 'SexCode']]
y = dataframe_clean['Survived']
# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 5: Standardize the features (optional but recommended)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 6: Fit the logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# Step 7: Extract coefficients and calculate odds ratios
coefficients = model.coef_[0]
odds_ratios = pd.Series(coefficients).apply(lambda x: np.exp(x))
# Step 8: Evaluate the model
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
# Print results
print(f"Coefficients: {coefficients}")
print(f"Odds Ratios: {odds_ratios}")
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n{report}")
Run the code - it should appear as the following - with the results displayed in line below the code block:
Note the results at the bottom of the screenshot above - the Coefficients are returned in the order of entry on step 3 - Age, SexCode. The Odds Ratios for 0 and 1 are for the Age and SexCode variables respectively.
The overall metrics for the model are presented on the bottom 3 rows of the output, with the model accuracy of 0.78. The model accurately predicts survival rate about 77.6% of the time (the top accuracy measure).
If you have previously worked with a logistic regression output, the interpretation of the above results should be fairly straight forward. If you haven’t - I have documented the findings below in a markdown format that we can add to the notebook.
You can add our interpretation of the report, using some basic markdown format, an open source text formatting standard.
Add a markdown box below he results by hovering your mouse over the bottom centre of the notebook (under the results output) and selecting + Markdown it when it appears.
Copy the code below for a formatted results section.
## Interpretation of the Logistic Regression Results
### Coefficients
Age Coefficient: -0.15802868
This negative coefficient indicates that as age increases, the log-odds of survival decrease. In other words, older passengers are less likely to survive.
SexCode Coefficient: 1.19358149
This positive coefficient indicates that being coded as 1 (female) increases the log-odds of survival.
### Odds Ratios
Age Odds Ratio: 0.853825
An odds ratio less than 1 (0.85) means that for each one-unit increase in age, the odds of survival decrease by approximately 15%.
SexCode Odds Ratio: 3.298875
An odds ratio greater than 1 (3.30) means that being coded as 1 (i.e. a female passenger) increases the odds of survival by approximately 230%.
### Model Accuracy
Accuracy: 0.7763157894736842
The model correctly predicts survival about 77.6% of the time.
### Classification Report
#### Precision, Recall, and F1-Score for Class 0 (Not Survived):
- Precision: 0.79 - When the model predicts not survived, it is correct 79% of the time.
- Recall: 0.84 - The model correctly identifies 84% of the actual not survived cases.
- F1-Score: 0.82 - The harmonic mean of precision and recall, indicating a good balance between the two.
#### Precision, Recall, and F1-Score for Class 1 (Survived):
- Precision: 0.75 - When the model predicts survived, it is correct 75% of the time.
- Recall: 0.68 - The model correctly identifies 68% of the actual survived cases.
- F1-Score: 0.72 - The harmonic mean of precision and recall, indicating a reasonable balance between the two.
#### Overall Metrics:
- Accuracy: 0.78 - The overall accuracy of the model.
- Macro Avg: Averages of precision, recall, and F1-score across both classes.
- Weighted Avg: Averages of precision, recall, and F1-score, weighted by the number of instances in each class.
### Summary
The logistic regression model shows that age negatively impacts the likelihood of survival, while the sex code positively impacts it. The model has a reasonable accuracy of 77.6%, with good precision and recall for both classes. The odds ratios provide a clear interpretation of how each feature affects the odds of survival.
When you have added the text, select the stop editing cell (tick button - top right) or press the escape key.
The Markdown cell will then display the format.
This was quite the process - I hope you found it as interesting! We aimed to provide a baseline understanding and hands-on experience in a hands-off blog-post way!
For ease of access going forward / in the event you have an issue with your code I’ve included a full extract of my final code in the section below.
Thanks again !
-Sam
Identify and document the packages in your Python virtual environment for reproducibility.
To ensure the results can be replicated, we need to capture the installed packages. We started with pandas. Then installed SciKit Learn and numpy with dependencies - where did we end up?
In the terminal type:
freeze > requirements.txt
A text file will be created in the base of the .venv folder named ‘requirements.txt’
Double click the file to open it. Wow! We have 36 packages installed!
asttokens==2.4.1
colorama==0.4.6
comm==0.2.2
debugpy==1.8.2
decorator==5.1.1
executing==2.0.1
ipykernel==6.29.5
ipython==8.26.0
jedi==0.19.1
joblib==1.4.2
jupyter_client==8.6.2
jupyter_core==5.7.2
matplotlib-inline==0.1.7
nest-asyncio==1.6.0
numpy==2.0.1
packaging==24.1
pandas==2.2.2
parso==0.8.4
platformdirs==4.2.2
prompt_toolkit==3.0.47
psutil==6.0.0
pure_eval==0.2.3
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
pywin32==306
pyzmq==26.0.3
scikit-learn==1.5.1
scipy==1.14.0
six==1.16.0
stack-data==0.6.3
threadpoolctl==3.5.0
tornado==6.4.1
traitlets==5.14.3
tzdata==2024.1
wcwidth==0.2.13
This is a really important file - keeping it in the folder will allow you to replicate the tests in future, ensuring the environmental dependencies are met. You can share the file with others so that they can replicate your configuration and assist with trouble shooting, or use this as a cornerstone for standardised environments in Azure Machine Learning if you want to productionise your approach.
Part 1: install, extensions, virtual env. Part 2: Initial Libraries and Data Import Whew! Data is in, virtual environment is up, and we have executed...
If this is the first post you have opened, I recommend you jump back to the Part 1. Install VS Code, relevant extensions and create a virtual...
Welcome to the mini-blog series on data science in Visual Studio (VS) Code!