Telco Customer Churn Analysis¶

In this project, I analyzed customer data from a telecommunications company to understand the factors that contributed to customer churn. I followed a data science workflow that included data cleaning, exploratory data analysis, feature engineering, predictive modeling using machine learning, and the extraction of actionable business insights.

The steps that were taken are summarized below:

  • I explored the Telco Customer Churn dataset to understand the distribution of variables and identify potential issues such as missing or inconsistent values.
  • I performed exploratory data analysis (EDA) to uncover trends and patterns associated with customer churn.
  • I prepared and transformed the data for modeling, including encoding categorical variables and handling missing data.
  • I built and compared several machine learning models—including Logistic Regression, Random Forest, and Gradient Boosting—to predict customer churn, using cross-validation and hyperparameter tuning where appropriate.
  • I evaluated model performance using accuracy, precision, recall, F1-score, and ROC-AUC to select the best model and interpret the most important features driving churn.
  • Finally, I summarized the main findings and provided business recommendations based on the results.

The goal of this analysis was to help the business identify at-risk customers and suggest strategies to improve customer retention using data-driven insights and predictive modeling.

Data Loading and Preparation¶

I began the analysis by loading the Telco Customer Churn dataset into a pandas DataFrame. I inspected the first few records to understand the structure of the data, checked for missing values or obvious inconsistencies, and performed initial data cleaning where necessary. Data types were reviewed and converted as needed to ensure proper handling in the subsequent analysis.

In [56]:
# Import essential libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('Telco_customer_churn.csv')

# Inspect the first few rows
df.head()
Out[56]:
CustomerID Count Country State City Zip Code Lat Long Latitude Longitude Gender ... Contract Paperless Billing Payment Method Monthly Charges Total Charges Churn Label Churn Value Churn Score CLTV Churn Reason
0 3668-QPYBK 1 United States California Los Angeles 90003 33.964131, -118.272783 33.964131 -118.272783 Male ... Month-to-month Yes Mailed check 53.85 108.15 Yes 1 86 3239 Competitor made better offer
1 9237-HQITU 1 United States California Los Angeles 90005 34.059281, -118.30742 34.059281 -118.307420 Female ... Month-to-month Yes Electronic check 70.70 151.65 Yes 1 67 2701 Moved
2 9305-CDSKC 1 United States California Los Angeles 90006 34.048013, -118.293953 34.048013 -118.293953 Female ... Month-to-month Yes Electronic check 99.65 820.5 Yes 1 86 5372 Moved
3 7892-POOKP 1 United States California Los Angeles 90010 34.062125, -118.315709 34.062125 -118.315709 Female ... Month-to-month Yes Electronic check 104.80 3046.05 Yes 1 84 5003 Moved
4 0280-XJGEX 1 United States California Los Angeles 90015 34.039224, -118.266293 34.039224 -118.266293 Male ... Month-to-month Yes Bank transfer (automatic) 103.70 5036.3 Yes 1 89 5340 Competitor had better devices

5 rows × 33 columns

In [57]:
# Check data info and missing values
df.info()

# Show summary statistics for numerical columns
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 non-null   object 
 17  Online Security    7043 non-null   object 
 18  Online Backup      7043 non-null   object 
 19  Device Protection  7043 non-null   object 
 20  Tech Support       7043 non-null   object 
 21  Streaming TV       7043 non-null   object 
 22  Streaming Movies   7043 non-null   object 
 23  Contract           7043 non-null   object 
 24  Paperless Billing  7043 non-null   object 
 25  Payment Method     7043 non-null   object 
 26  Monthly Charges    7043 non-null   float64
 27  Total Charges      7043 non-null   object 
 28  Churn Label        7043 non-null   object 
 29  Churn Value        7043 non-null   int64  
 30  Churn Score        7043 non-null   int64  
 31  CLTV               7043 non-null   int64  
 32  Churn Reason       1869 non-null   object 
dtypes: float64(3), int64(6), object(24)
memory usage: 1.8+ MB
Out[57]:
Count Zip Code Latitude Longitude Tenure Months Monthly Charges Churn Value Churn Score CLTV
count 7043.0 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000
mean 1.0 93521.964646 36.282441 -119.798880 32.371149 64.761692 0.265370 58.699418 4400.295755
std 0.0 1865.794555 2.455723 2.157889 24.559481 30.090047 0.441561 21.525131 1183.057152
min 1.0 90001.000000 32.555828 -124.301372 0.000000 18.250000 0.000000 5.000000 2003.000000
25% 1.0 92102.000000 34.030915 -121.815412 9.000000 35.500000 0.000000 40.000000 3469.000000
50% 1.0 93552.000000 36.391777 -119.730885 29.000000 70.350000 0.000000 61.000000 4527.000000
75% 1.0 95351.000000 38.224869 -118.043237 55.000000 89.850000 1.000000 75.000000 5380.500000
max 1.0 96161.000000 41.962127 -114.192901 72.000000 118.750000 1.000000 100.000000 6500.000000

Data Cleaning and Preparation¶

After loading the data, I carefully inspected the dataset for missing values, incorrect data types, and inconsistencies. I addressed any missing or anomalous values, converted columns to their appropriate data types (such as numeric or categorical), and standardized categorical variables to ensure consistency. These steps ensured that the dataset was ready for exploratory analysis and machine learning.

In [58]:
# Check for missing values
df.isnull().sum()
Out[58]:
CustomerID              0
Count                   0
Country                 0
State                   0
City                    0
Zip Code                0
Lat Long                0
Latitude                0
Longitude               0
Gender                  0
Senior Citizen          0
Partner                 0
Dependents              0
Tenure Months           0
Phone Service           0
Multiple Lines          0
Internet Service        0
Online Security         0
Online Backup           0
Device Protection       0
Tech Support            0
Streaming TV            0
Streaming Movies        0
Contract                0
Paperless Billing       0
Payment Method          0
Monthly Charges         0
Total Charges           0
Churn Label             0
Churn Value             0
Churn Score             0
CLTV                    0
Churn Reason         5174
dtype: int64
In [59]:
# Review column data types
df.dtypes
Out[59]:
CustomerID            object
Count                  int64
Country               object
State                 object
City                  object
Zip Code               int64
Lat Long              object
Latitude             float64
Longitude            float64
Gender                object
Senior Citizen        object
Partner               object
Dependents            object
Tenure Months          int64
Phone Service         object
Multiple Lines        object
Internet Service      object
Online Security       object
Online Backup         object
Device Protection     object
Tech Support          object
Streaming TV          object
Streaming Movies      object
Contract              object
Paperless Billing     object
Payment Method        object
Monthly Charges      float64
Total Charges         object
Churn Label           object
Churn Value            int64
Churn Score            int64
CLTV                   int64
Churn Reason          object
dtype: object
In [60]:
# Convert 'Total Charges' to numeric, if not already
df['Total Charges'] = pd.to_numeric(df['Total Charges'], errors='coerce')
In [61]:
# Identify categorical columns based on your list
cat_columns = [
    'Gender', 'Senior Citizen', 'Partner', 'Dependents',
    'Phone Service', 'Multiple Lines', 'Internet Service',
    'Online Security', 'Online Backup', 'Device Protection',
    'Tech Support', 'Streaming TV', 'Streaming Movies',
    'Contract', 'Paperless Billing', 'Payment Method',
    'Churn Label'
]

for col in cat_columns:
    df[col] = df[col].astype('category')
In [62]:
# Check missing 'Churn Reason'
df['Churn Reason'].isnull().sum()
Out[62]:
np.int64(5174)
In [63]:
# Example: Fill missing 'Churn Reason' with 'Not Churned'
df['Churn Reason'] = df['Churn Reason'].fillna('Not Churned')
In [64]:
# Confirm the changes
df.info()
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   CustomerID         7043 non-null   object  
 1   Count              7043 non-null   int64   
 2   Country            7043 non-null   object  
 3   State              7043 non-null   object  
 4   City               7043 non-null   object  
 5   Zip Code           7043 non-null   int64   
 6   Lat Long           7043 non-null   object  
 7   Latitude           7043 non-null   float64 
 8   Longitude          7043 non-null   float64 
 9   Gender             7043 non-null   category
 10  Senior Citizen     7043 non-null   category
 11  Partner            7043 non-null   category
 12  Dependents         7043 non-null   category
 13  Tenure Months      7043 non-null   int64   
 14  Phone Service      7043 non-null   category
 15  Multiple Lines     7043 non-null   category
 16  Internet Service   7043 non-null   category
 17  Online Security    7043 non-null   category
 18  Online Backup      7043 non-null   category
 19  Device Protection  7043 non-null   category
 20  Tech Support       7043 non-null   category
 21  Streaming TV       7043 non-null   category
 22  Streaming Movies   7043 non-null   category
 23  Contract           7043 non-null   category
 24  Paperless Billing  7043 non-null   category
 25  Payment Method     7043 non-null   category
 26  Monthly Charges    7043 non-null   float64 
 27  Total Charges      7032 non-null   float64 
 28  Churn Label        7043 non-null   category
 29  Churn Value        7043 non-null   int64   
 30  Churn Score        7043 non-null   int64   
 31  CLTV               7043 non-null   int64   
 32  Churn Reason       7043 non-null   object  
dtypes: category(17), float64(4), int64(6), object(6)
memory usage: 999.6+ KB
Out[64]:
CustomerID            0
Count                 0
Country               0
State                 0
City                  0
Zip Code              0
Lat Long              0
Latitude              0
Longitude             0
Gender                0
Senior Citizen        0
Partner               0
Dependents            0
Tenure Months         0
Phone Service         0
Multiple Lines        0
Internet Service      0
Online Security       0
Online Backup         0
Device Protection     0
Tech Support          0
Streaming TV          0
Streaming Movies      0
Contract              0
Paperless Billing     0
Payment Method        0
Monthly Charges       0
Total Charges        11
Churn Label           0
Churn Value           0
Churn Score           0
CLTV                  0
Churn Reason          0
dtype: int64

Exploratory Data Analysis (EDA)¶

In this section, I performed a comprehensive exploratory data analysis to uncover patterns and relationships in the Telco customer data. I visualized the distribution of the target variable (churn), examined how churn relates to contract type, payment method, service add-ons, and customer demographics, and analyzed geographic patterns by city (as all data are from California). For features with multiple categories, I used stacked barplots to better compare churned and retained customers.

Additionally, I explored interactions between key numeric variables—such as monthly charges and tenure—using bivariate plots, and I generated correlation heatmaps to identify relationships among numerical features. This detailed EDA informed later feature selection and provided actionable insights into what drives customer churn.

In [65]:
import matplotlib.pyplot as plt
import seaborn as sns

# Overall churn distribution
sns.countplot(x='Churn Label', data=df, palette='Set2')
plt.title('Customer Churn Distribution')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.show()

# Optionally, show churn rate numerically
churn_rate = df['Churn Label'].value_counts(normalize=True).rename_axis('Churn').reset_index(name='Proportion')
print(churn_rate)
C:\Users\johna\AppData\Local\Temp\ipykernel_9104\3277478164.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x='Churn Label', data=df, palette='Set2')
No description has been provided for this image
  Churn  Proportion
0    No     0.73463
1   Yes     0.26537
In [66]:
plt.figure(figsize=(8,5))
sns.countplot(x='Contract', hue='Churn Label', data=df, palette='Set1')
plt.title('Churn by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Count')
plt.legend(title='Churn')
plt.show()

# Optionally, show churn rate numerically
churn_rate = df['Churn Label'].value_counts(normalize=True).rename_axis('Churn').reset_index(name='Proportion')
print(churn_rate)
No description has been provided for this image
  Churn  Proportion
0    No     0.73463
1   Yes     0.26537
In [67]:
plt.figure(figsize=(10,5))
sns.countplot(y='Payment Method', hue='Churn Label', data=df, palette='Set3')
plt.title('Churn by Payment Method')
plt.xlabel('Count')
plt.ylabel('Payment Method')
plt.legend(title='Churn')
plt.show()
No description has been provided for this image
In [68]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Churn Label', y='Monthly Charges', data=df)
plt.title('Monthly Charges by Churn')
plt.xlabel('Churn')
plt.ylabel('Monthly Charges')
plt.show()
No description has been provided for this image
In [69]:
plt.figure(figsize=(6,4))
sns.histplot(data=df, x='Tenure Months', hue='Churn Label', bins=30, kde=True, element='step', stat='density')
plt.title('Tenure Distribution by Churn')
plt.xlabel('Tenure (Months)')
plt.ylabel('Density')
plt.show()
No description has been provided for this image
In [70]:
plt.figure(figsize=(5,3))
sns.countplot(x='Senior Citizen', hue='Churn Label', data=df, palette='Accent')
plt.title('Churn by Senior Citizen Status')
plt.xlabel('Senior Citizen')
plt.ylabel('Count')
plt.legend(title='Churn')
plt.show()

# Numeric
senior_churn = df.groupby('Senior Citizen', observed=True)['Churn Value'].mean() * 100

print(senior_churn)
No description has been provided for this image
Senior Citizen
No     23.606168
Yes    41.681261
Name: Churn Value, dtype: float64
In [71]:
service_features = [
    'Online Security', 'Tech Support', 'Streaming TV', 
    'Streaming Movies', 'Device Protection', 'Online Backup', 'Multiple Lines'
]

for feature in service_features:
    plt.figure(figsize=(6,3))
    sns.countplot(x=feature, hue='Churn Label', data=df, palette='coolwarm')
    plt.title(f'Churn by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.legend(title='Churn')
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [72]:
# Top 10 cities by number of customers
top_cities = df['City'].value_counts().head(10).index
plt.figure(figsize=(10,5))
sns.countplot(y='City', hue='Churn Label', data=df[df['City'].isin(top_cities)], palette='Set2', order=top_cities)
plt.title('Churn by City (California)')
plt.xlabel('Count')
plt.ylabel('City')
plt.legend(title='Churn')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [73]:
# Function to plot stacked bar chart for any categorical feature
def stacked_barplot(feature):
    cross = pd.crosstab(df[feature], df['Churn Label'])
    cross_norm = cross.div(cross.sum(axis=1), axis=0)
    cross_norm.plot(kind='bar', stacked=True, figsize=(7,4), colormap='tab20c')
    plt.title(f'Stacked Barplot of {feature} by Churn')
    plt.xlabel(feature)
    plt.ylabel('Proportion')
    plt.legend(title='Churn', loc='upper right')
    plt.tight_layout()
    plt.show()

# Example usage
for col in ['Internet Service', 'Streaming TV', 'Payment Method']:
    stacked_barplot(col)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [74]:
plt.figure(figsize=(10,7))
numeric_cols = ['Tenure Months', 'Monthly Charges', 'Total Charges', 'Churn Value', 'Churn Score', 'CLTV']
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap (Numeric Features)')
plt.show()
No description has been provided for this image

Correlation Heatmap Insights¶

The correlation heatmap for numeric features shows that tenure months and total charges are strongly correlated, which is logical since longer-tenure customers accumulate higher total charges. There’s also a negative correlation between tenure and churn value, confirming that newer customers are more likely to churn. Monthly charges are positively correlated with churn, but the relationship is weaker than for tenure. These insights reinforce the importance of focusing retention efforts on new customers and monitoring those with higher monthly spending.

In [75]:
sns.pairplot(df[numeric_cols + ['Churn Label']], hue='Churn Label', corner=True)
plt.suptitle('Pairplot of Key Numeric Features by Churn', y=1.02)
plt.show()
No description has been provided for this image

Pairplot: Churn Patterns in Numeric Features¶

The pairplot allows us to visually explore interactions among numeric variables and churn status. Churned customers (orange) are densely concentrated among those with low tenure months and lower total charges, but often have higher monthly charges. The distribution of CLTV (customer lifetime value) is also lower for churned customers. This visualization clearly illustrates the profiles of at-risk customers: they tend to be newer to the service and are not yet delivering high lifetime value, despite sometimes paying more per month.

In [76]:
plt.figure(figsize=(8,5))
sns.scatterplot(
    data=df, 
    x='Tenure Months', 
    y='Monthly Charges', 
    hue='Churn Label', 
    alpha=0.5, 
    palette='coolwarm'
)
plt.title('Monthly Charges vs. Tenure Months by Churn')
plt.xlabel('Tenure (Months)')
plt.ylabel('Monthly Charges')
plt.legend(title='Churn')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [77]:
plt.figure(figsize=(10,7))
sns.scatterplot(
    x='Longitude', y='Latitude', 
    data=df, hue='Churn Label', 
    alpha=0.5, palette='coolwarm'
)
plt.title('Customer Locations by Churn - California state')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Churn')
plt.show()
No description has been provided for this image

Summary of EDA Findings¶

The exploratory data analysis revealed several important trends in the California Telco customer dataset:

  • Churn is concentrated among customers with month-to-month contracts, higher monthly charges, and shorter tenures. Customers who remain for longer periods, or who commit to longer contracts, are much less likely to churn.
  • Service add-ons matter: Customers without add-on services such as online security, tech support, or streaming options are more likely to leave. Upselling additional services may improve retention.
  • Payment method and demographic factors (e.g., senior citizens) also showed some association with churn, suggesting that personalized engagement strategies may be effective for certain segments.
  • City-level analysis identified cities with higher churn rates, potentially helping the business target regional retention campaigns.
  • Correlations between tenure, total charges, and churn further supported the importance of customer lifecycle management.
  • Bivariate visualizations of key features (e.g., monthly charges vs. tenure) highlighted distinct profiles of at-risk customers.

Overall, the EDA provided a rich understanding of churn drivers and offered clear guidance for subsequent feature engineering and predictive modeling efforts.

Feature Engineering and Data Preparation¶

After completing exploratory analysis, I prepared the dataset for machine learning. This involved selecting relevant features, encoding categorical variables, and scaling numerical features as needed. I excluded columns unlikely to be helpful for prediction (such as customer IDs and geographic text data), and created the input matrix (X) and target vector (y). These steps ensured the data was in the optimal format for model development and evaluation.

In [78]:
drop_cols = [
    'CustomerID', 'Country', 'State', 'City', 'Zip Code', 'Lat Long',
    'Latitude', 'Longitude', 'Churn Label', 'Churn Reason', 'Count'
]

X = df.drop(columns=drop_cols + ['Churn Value'])
y = df['Churn Value']

print(X.columns)  # See your remaining features before next step
Index(['Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Tenure Months',
       'Phone Service', 'Multiple Lines', 'Internet Service',
       'Online Security', 'Online Backup', 'Device Protection', 'Tech Support',
       'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing',
       'Payment Method', 'Monthly Charges', 'Total Charges', 'Churn Score',
       'CLTV'],
      dtype='object')
In [79]:
# Confirm the exact names of your numeric columns
numeric_features = ['Tenure Months', 'Monthly Charges', 'Total Charges']
numeric_features = [col for col in numeric_features if col in X.columns]

categorical_features = [col for col in X.columns if col not in numeric_features]

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)
Numeric features: ['Tenure Months', 'Monthly Charges', 'Total Charges']
Categorical features: ['Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method', 'Churn Score', 'CLTV']
In [80]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_numeric = pd.DataFrame(
    scaler.fit_transform(X[numeric_features]),
    columns=numeric_features,
    index=X.index
)
In [81]:
X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
In [82]:
import pandas as pd

X_prepared = pd.concat([X_numeric, X_categorical], axis=1)

print(X_prepared.shape)
print(X_prepared.columns[:10])  # Show first 10 columns to check
(7043, 32)
Index(['Tenure Months', 'Monthly Charges', 'Total Charges', 'Churn Score',
       'CLTV', 'Gender_Male', 'Senior Citizen_Yes', 'Partner_Yes',
       'Dependents_Yes', 'Phone Service_Yes'],
      dtype='object')
In [83]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_prepared, y, test_size=0.2, random_state=42, stratify=y
)
In [84]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_prepared, y, test_size=0.2, random_state=42, stratify=y
)

Predictive Modeling with XGBoost¶

With the dataset prepared, I implemented a machine learning model using XGBoost—a robust, high-performing algorithm well-suited to structured tabular data. I trained the model to predict customer churn based on the engineered features, evaluated its performance on a held-out test set, and analyzed feature importance to interpret key drivers of churn.

In [85]:
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay

# Train XGBoost Classifier
xgb_clf = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
xgb_clf.fit(X_train, y_train)

# Predict on test set
y_pred = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1]

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# ROC AUC
auc = roc_auc_score(y_test, y_proba)
print("ROC-AUC Score:", round(auc, 3))

# ROC Curve
RocCurveDisplay.from_estimator(xgb_clf, X_test, y_test)
plt.title('XGBoost ROC Curve')
plt.show()
C:\Users\johna\AppData\Local\Programs\Python\Python313\Lib\site-packages\xgboost\training.py:183: UserWarning: [00:22:54] WARNING: C:\actions-runner\_work\xgboost\xgboost\src\learner.cc:738: 
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
              precision    recall  f1-score   support

           0       0.96      0.95      0.95      1035
           1       0.87      0.89      0.88       374

    accuracy                           0.93      1409
   macro avg       0.91      0.92      0.92      1409
weighted avg       0.93      0.93      0.93      1409

Confusion Matrix:
[[984  51]
 [ 43 331]]
ROC-AUC Score: 0.983
No description has been provided for this image
In [86]:
importances = xgb_clf.feature_importances_
features = X_train.columns

fi_df = pd.DataFrame({'feature': features, 'importance': importances}).sort_values('importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x='importance', y='feature', data=fi_df.head(15), palette='Blues_r')
plt.title('Top 15 Feature Importances (XGBoost)')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
C:\Users\johna\AppData\Local\Temp\ipykernel_9104\160547992.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='importance', y='feature', data=fi_df.head(15), palette='Blues_r')
No description has been provided for this image

XGBoost Feature Importance¶

XGBoost’s built-in feature importance highlights Churn Score, contract type, and fiber optic internet service as the strongest predictors of churn. Notably, customers with two-year contracts or without fiber optic internet are much less likely to churn. Other meaningful features include the presence of dependents, electronic check payment method, monthly charges, and tenure months.

These results confirm that both customer demographics and service characteristics play a critical role in churn risk, providing actionable levers for retention strategy.

Model Evaluation and Insights¶

After training the XGBoost classifier, I evaluated its performance on the test set:

1. Classification Report & Key Metrics¶

  • Accuracy: The model achieved a high accuracy of 93% on the test set.
  • Precision: For the churn class (1), precision was 0.87—meaning 87% of customers predicted to churn actually did.
  • Recall: Recall for churn was 0.89, indicating the model correctly identified 89% of true churners. This is important for retention, as missed churners (false negatives) represent lost opportunities.
  • F1-score: The F1-score for churn was 0.88, demonstrating a balanced trade-off between precision and recall.

2. Confusion Matrix¶

  • True Negatives (TN): 984 non-churners were correctly classified.
  • False Positives (FP): 51 non-churners were incorrectly predicted as churners.
  • False Negatives (FN): 43 churners were missed by the model.
  • True Positives (TP): 331 churners were correctly identified.
  • The model managed to capture most churners while keeping false alarms low, providing actionable accuracy for business use.

3. ROC-AUC Score¶

  • The ROC-AUC score was 0.983, indicating the model is exceptionally strong at ranking customers by churn risk, even across different thresholds.

4. Feature Importance Analysis¶

  • The top drivers of churn prediction (from the feature importance plot) included:
    • Tenure Months: Customers with shorter tenure were much more likely to churn, confirming EDA insights.
    • Contract Type: Month-to-month contract holders had higher churn risk.
    • Service Add-ons: Lack of online security, tech support, and streaming services contributed to churn likelihood.
    • Monthly Charges & Payment Method: Higher charges and specific payment types (e.g., electronic check) were associated with increased churn.

5. Recommendations¶

  • Target short-tenure and month-to-month customers with special onboarding and retention offers.
  • Encourage adoption of add-on services (e.g., online security, tech support) as these are associated with higher retention.
  • Focus retention efforts on high-risk segments identified by the model (e.g., those with high monthly charges or certain payment methods).
  • Use the model's probability outputs to trigger retention campaigns before the customer churns.

6. Next Steps¶

  • Further model interpretation using SHAP or LIME could reveal even more about individual predictions.
  • Testing alternative models or ensembling may yield incremental gains.
  • Integrate the model into business operations for real-time churn risk monitoring.

Overall, the XGBoost model delivered highly accurate churn predictions and actionable business insights, positioning the company to proactively reduce customer loss.

Model Interpretability with SHAP¶

To better understand the individual and global impact of each feature on churn predictions, I used SHAP (SHapley Additive exPlanations) values. SHAP provides intuitive visualizations that help explain why the model predicts high churn risk for specific customers, making the results more transparent and actionable for business stakeholders.

In [89]:
import shap

# Explain model predictions using SHAP
explainer = shap.TreeExplainer(xgb_clf)
shap_values = explainer.shap_values(X_test)

# Global feature importance summary plot
shap.summary_plot(shap_values, X_test, plot_type='bar')

# Detailed beeswarm plot (shows how each feature impacts churn predictions for all samples)
shap.summary_plot(shap_values, X_test)
No description has been provided for this image
No description has been provided for this image

SHAP Analysis: Explaining Model Predictions¶

The SHAP summary plot reveals that Churn Score, Tenure Months, and contract type features (especially Contract_Two year) are the most influential in the XGBoost model’s predictions. High churn scores and short tenure (shown in pink on the left side of the plot) strongly push the model toward predicting churn.

Contract features demonstrate that customers on longer contracts are less likely to churn, while add-ons like dependents and specific internet/payment types also play notable roles. SHAP helps confirm that the model is learning patterns consistent with our business intuition and EDA findings, and gives us transparency for individual customer decisions.

Summary of Model Interpretability and Insights¶

Combining SHAP analysis, XGBoost feature importances, correlation heatmaps, and pairwise plots provides a 360-degree view of customer churn drivers. The model consistently identified short tenure, high churn score, flexible contracts, and lack of service add-ons as primary risk factors. These results align with domain knowledge and earlier EDA, giving confidence in model validity.

For the business, this means focusing retention strategies on:

  • New customers in their first year
  • Those with flexible (month-to-month) contracts
  • High churn scores or low engagement with add-on services
  • Customers with high monthly charges but low total lifetime value

Model explainability (via SHAP) makes it possible to deliver actionable, individualized interventions, increasing the practical value of this churn solution.

Conclusion and Business Recommendations¶

This project successfully combined deep exploratory data analysis, robust feature engineering, and state-of-the-art machine learning with XGBoost to predict customer churn for a California telecommunications provider. The model achieved high accuracy (93%) and an outstanding ROC-AUC (0.983), demonstrating strong predictive power.

Key drivers of churn were short customer tenure, month-to-month contracts, lack of service add-ons, higher monthly charges, and certain payment methods. By using SHAP, I further validated that these features consistently influence individual churn predictions, providing actionable insights for business strategy.

Recommended Actions¶

  • Proactive Retention: Use the model’s risk scores to proactively target customers likely to churn—especially those with short tenure and month-to-month contracts.
  • Service Bundling: Promote and incentivize add-on services (e.g., online security, tech support) to increase stickiness.
  • Contract Incentives: Develop attractive loyalty or contract-length incentives for at-risk customers.
  • Personalized Offers: Leverage SHAP-driven insights to craft personalized retention offers, focusing on the most important factors for each customer segment.
  • Continuous Monitoring: Integrate the churn model into operational dashboards to monitor churn risk and campaign effectiveness in real time.

Next Steps¶

  • Deploy the model in a live environment for real-time churn prediction.
  • Regularly retrain and monitor model performance as customer behavior and offerings evolve.
  • Explore additional features (e.g., customer support interactions, satisfaction scores) for even stronger predictions.

With this end-to-end solution, the business is empowered to significantly reduce customer loss and drive long-term growth.

In [ ]: