Classify Prospective Leads Using Machine Learning

8 min readMay 11, 2024

What is leads Scoring?

Lead scoring is a methodology used by sales and marketing departments to determine the worthiness of leads or potential customers by assigning values based on their behavior related to their interest in products or services. — TechTarget.com

What benefits leads scoring?

Prioritize Sales Efforts: It helps to identify and prioritize leads that are most likely to convert, allowing sales teams to focus their efforts on the most promising prospects.
Enhance Lead Quality: It improves the quality of leads passed to sales by ensuring that only the leads with the highest potential are pursued.
Improve ROI on Marketing Campaigns: By focusing on high-scoring leads, companies can achieve a higher return on investment for their marketing campaigns.

What is Predictive Leads Scoring Method?

Predictive leads scoring models use machine learning to generate a predictive model based on historical customer data. The approach is to analyze past lead behavior or past interactions between a company and leads and find positive correlations of such data to a positive business outcome (for instance, a closed deal).

Create Leads Scoring Model

I will provide detailed guidance on implementing the Leads Scoring Model using Python. This will include step-by-step instructions to help you understand and execute each process.

Import Libraries

# Data manipulation libraries
import pandas as pd
import numpy as np

# Preprocessing libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# Model selection libraries
from sklearn.model_selection import train_test_split, cross_val_score

# Model building libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import xgboost as xgb
from xgboost import XGBClassifier  # Direct import of XGBClassifier for convenience

# Model evaluation libraries
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Visualization libraries
import plotly.figure_factory as ff
import plotly.graph_objects as go# Data manipulation libraries

Load Dataset

I use leads scoring dataset on Kaggle.

You can access the dataset using this link: https://www.kaggle.com/datasets/amritachatterjee09/lead-scoring-dataset.

# Dataset from kaggle: https://www.kaggle.com/datasets/amritachatterjee09/lead-scoring-dataset

df = pd.read_csv('Lead Scoring.csv')

Data Cleaning

Fill missing values

Based on our dataset, a few columns have missing values, so we need to fill the missing values.

# Identify numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64'])

# Fill missing numeric data with their mean
df[numeric_cols.columns] = numeric_cols.fillna(numeric_cols.mean())

# Identify non-numeric columns
non_numeric_cols = df.select_dtypes(exclude=['int64', 'float64'])

# Fill missing non-numeric data with 'missing'
df[non_numeric_cols.columns] = non_numeric_cols.fillna('Missing')

For numeric columns, we fill in missing values using the mean of data, and for non-numeric data, we add the value “Missing.”

Feature Selection and Engineering

Feature Encoding

Feature encoding is crucial for many machine learning algorithms. This is for machine learning compatibility, as many machine learning models cannot directly handle non-numeric data (e.g., definite strings). They require all input data to be numeric, so we must transform categorical data into numeric.

# Define a function to encode non-numeric data types using one-hot encoding or label encoding.

def encode_non_numeric(df):
    """
    Encodes non-numeric columns in the DataFrame using label encoding.

    Parameters:
        df (pd.DataFrame): The DataFrame to encode.

    Returns:
        pd.DataFrame: A DataFrame with non-numeric columns encoded.
    """
    # Create a copy of the DataFrame to avoid modifying the original data
    encoded_df = df.copy()

    # Identify non-numeric columns
    non_numeric_cols = encoded_df.select_dtypes(exclude=['int64', 'float64'])

    # Initialize LabelEncoder
    le = LabelEncoder()

    # Apply LabelEncoder to each non-numeric column
    for column in non_numeric_cols.columns:
        # Fit and transform the data
        # Use `astype(str)` to ensure proper conversion for mixed types
        encoded_df[column] = le.fit_transform(non_numeric_cols[column].astype(str))

    return encoded_df

Data Modelling

Model Preparation

This part performs data preprocessing and splitting a dataset into training and testing sets for a machine learning model.

# Setting 'Lead Number' as the index
df_encode.set_index('Lead Number', inplace=True)

# Dropping unnecessary column 'Prospect ID'
df_encode.drop(['Prospect ID'], axis=1, inplace=True)

# Assuming all categorical data are already encoded and proceeding with train-test split
from sklearn.model_selection import train_test_split

# Split the data
X = df_encode.drop('Converted', axis=1)
y = df_encode['Converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Checking the shape of the train and test sets
X_train.shape, X_test.shape

Model Selection

We tried several models to choose the best one to predict leads conversion, from logistic regression, random forest, gradient boosting, and SVM classifier to the XGBoost classifier model.

Logistic Regression is a statistical model that estimates the probability of a binary outcome based on one or more predictor variables. It’s widely used for binary classification tasks such as spam detection or determining whether a loan should be approved.
Random Forest is an ensemble learning method for classification and regression that operates by constructing multiple decision trees during training. The final prediction is made based on the majority vote (for classification) or average prediction (for regression) of the individual trees, which helps achieve higher accuracy and robustness against overfitting.
Gradient Boosting is a machine learning technique that builds models incrementally using an ensemble of weak prediction models, typically decision trees. It improves model predictions by focusing on correcting the mispredictions of previous models through iterative learning.
SVM (Support Vector Machine) Classifier is a powerful and versatile classification technique that finds the hyperplane which best separates different classes in the feature space. SVMs are effective in high-dimensional spaces and are versatile as they can be equipped with different kernel functions to handle non-linear separations.
XGBoost Classifier is stands for eXtreme Gradient Boosting, it is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework, providing a scalable, efficient, and fast implementation of gradient boosting that has proven to be effective in numerous machine learning competitions.

#Model Selection

def logistic_regression(X_train, y_train):
    model = LogisticRegression()
    model.fit(X_train, y_train)
    return model

def random_forest(X_train, y_train):
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    return model

def gradient_boosting(X_train, y_train):
    model = GradientBoostingClassifier()
    model.fit(X_train, y_train)
    return model

def svm_classifier(X_train, y_train):
    model = SVC()
    model.fit(X_train, y_train)
    return model

def xgboost_classifier(X_train, y_train):
    model = XGBClassifier()
    model.fit(X_train, y_train)
    return model

Model Evaluation

# Model evaluation function
def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    report = classification_report(y_test, predictions)
    return accuracy, report

# Running and evaluating each model
models = {
    "Logistic Regression": logistic_regression,
    "Random Forest": random_forest,
    "Gradient Boosting": gradient_boosting,
    "SVM": svm_classifier,
    "XGBoost": xgboost_classifier
}

# Compare the models
for name, model_func in models.items():
    print(f"Running {name}...")
    model = model_func(X_train, y_train)
    accuracy, report = evaluate_model(model, X_test, y_test)
    print(f"{name} Accuracy: {accuracy}")
    print(f"Classification Report for {name}:\n{report}\n")

Based on the model evaluation, XGBoost has the highest level of accuracy, namely ~94%, so the model that will be used to predict leads scoring is XGBoost.

Confusion Matrix for XGBoost Model

Based on the image above, we have a relatively low false positive and negative rate, which means the model is excellent.

Fit Data into Model

# Choose XGBoost for the best model.

# Building the XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)

# Getting probability predictions and rounding to two decimal places
xgb_prob_predictions = [round(prob, 2) for prob in xgb_model.predict_proba(X_test)[:, 1]]  # Assuming binary classification

# Function to categorize probabilities
def categorize_probability(prob):
    if 0.1 <= prob <= 0.3:
        return 'Low Probability'
    elif 0.4 <= prob <= 0.6:
        return 'Medium Probability'
    elif 0.7 <= prob <= 1.0:
        return 'High Probability'
    else:
        return 'Undefined'  # For probabilities outside the specified ranges

# Apply categorization to probability predictions
probability_categories = [categorize_probability(prob) for prob in xgb_prob_predictions]

# Convert probabilities to 0 or 1 based on threshold using list comprehension
xgb_predictions = [int(prob > 0.5) for prob in xgb_prob_predictions]

# Evaluate the model using the binary predictions
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
xgb_report = classification_report(y_test, xgb_predictions)

# Print the results
print("XGBoost Model Accuracy:", xgb_accuracy)
print("Classification Report:\n", xgb_report)

# Adding predictions to the test set for review
X_test['Predicted_Probability'] = xgb_prob_predictions
X_test['Probability_Category'] = probability_categories
X_test['Predicted_Conversion'] = xgb_predictions
X_test['Actual_Conversion'] = y_test

I try to use test data to re-predict the final result of the model, you can try using new data other than training or testing data, here I use testing data because I think the data is relatively small, so it’s just for demonstration of running the model.

I also added probability categorizations, such as low probability if the probability is 0.1–0.3, medium probability if it is 0.4–0.6, and high probability if it is 0.7–1.0. However, for the purposes of metric evaluation, if the probability is> 0.5, then it will be considered one or predicted to be converted.

Feature Importances

Based on the image above, the Top 5 Features that most influence the model:

Lead Quality: Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead.
Tags: Tags assigned to customers indicating the current status of the lead.
Lead Origin: The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
What is your current occupation: Indicates whether the customer is a student, unemployed or employed.
Last Notable Activity: The last notable activity performed by the student.

Result

# Recoding the 'Predicted_Class' and 'Actual_Class'
X_test['Predicted_Class'] = X_test['Predicted_Conversion'].replace({0: 'Not Converted', 1: 'Converted'})
X_test['Actual_Class'] = X_test['Actual_Conversion'].replace({0: 'Not Converted', 1: 'Converted'})

# Resetting the index of X_test
X_test_reset = X_test.reset_index()

# Joining df1 with X_test_reset
result_df = df.join(X_test_reset, rsuffix='_drop', how='inner')

# Dropping columns from X_test that have the same names (these columns have '_drop' suffix)
result_df = result_df[[col for col in result_df.columns if not col.endswith('_drop')]]

# Dropping the 'Converted' column
result_df.drop('Converted', axis=1, inplace=True)

# Show DataFrame Leads Scoring result
result_df

Based on the above results, we will have the following information:

1. Predicted Probability: the probability interval from 0–1.
2. Probability Category: probability classes include low, medium, and high probability.
3. Predicted Conversion: value 0 for not converted and 1 for converted.
4. Predicted Class: value “Converted” and “Not-Converted”.

These results make more data-based decision-making and can make sales efforts more effective.

Conclusion

In summary, lead scoring is an essential method that helps sales and marketing teams pinpoint and prioritize prospective clients most likely to purchase. Thus, lead scoring improves the quality of leads and boosts the return on marketing investments.

By employing predictive lead scoring models that use machine learning, like the comprehensive XGBoost model discussed here, companies can utilize past data to foresee future actions and make well-informed choices.

This technique makes the sales process more efficient by concentrating on high-value leads and refines marketing tactics, guaranteeing that resources are devoted to the most impactful initiatives.

References:

Thank you for reading, and feel free to connect with me on LinkedIn

GitHub Link: https://github.com/Ishlafakhri/Leads-Scoring-Model