A Hands-on Introduction to H2O’s AutoML: E-commerce Churn Prediction

Table of Contents ( ← TOC)

Imagine you’re learning to cook, but instead of struggling through chopping onions and simmering sauces, you stumble upon a magical kitchen gadget. This gadget takes your ingredients, whips up a delicious meal, and even cleans up after itself. Sounds incredible, right? Well, that’s exactly how I felt when I first discovered H2O’s AutoML. It can handle all sorts of data and makes creating machine-learning models a breeze. I’ve had such a good time with it, I just had to share what I’ve learned.

Understanding AutoML

Think of it like this: You know how ChatGPT makes having a conversation feel easy? It’s because a lot of hard work and complex processes are going on behind the scenes to understand what you’re saying and respond appropriately. But for you, it’s just a chat. Now, imagine something similar, but for machine learning training and prediction. That’s what AutoML is.

In the world of machine learning, AutoML is like your personal assistant. It does all the heavy lifting that comes with setting up and running machine learning projects. It handles everything from dealing with raw data to picking the best model, all under the hood. The idea is to make machine learning less complicated and more efficient, just like ChatGPT does with conversation. So, whether you’re a seasoned pro or just starting out, AutoML has got your back.

In this blog, I’ll walk you through what makes H2O AutoML so great, and I’ll share a project I’ve been working on. So, stick around and let’s see what H2O AutoML can do!

Diving into H2O

H2O is an open-source platform built for data analysis and machine learning. Developed by H2O.ai, it’s designed to bring fast, scalable machine learning to individuals and organizations, regardless of their size or resources. H2O is written in Java, but it provides APIs for languages like Python, R, and Scala, making it accessible to a wide range of developers and data scientists.

AutoML Workflow Overview from H2O.ai

How is H2O different?

H2O sets itself apart with its ability to handle large datasets that wouldn’t fit into a single machine’s memory.
It breaks big tasks into smaller ones, distributes them across multiple machines, and then collates the results.
This makes it an excellent tool for businesses with large amounts of data to process quickly.

H2O AutoML

H2O’s AutoML is an automated machine learning system that simplifies the process of building machine learning models.
It handles everything from data preprocessing and feature engineering to model selection and tuning.
Using H2O’s AutoML requires minimal machine learning knowledge — you just provide your dataset, specify a target variable, and set a time limit.
H2O AutoML trains and tunes various models and then presents a leaderboard of the best ones.
It’s a powerful tool for both beginners and experienced data scientists — it simplifies learning for beginners and saves time and effort for experienced users.

Implementing H20 AutoML

A little while back, I took on a project that revolved around predicting customer churn for an eCommerce company. If you’re not familiar with the term, “churn” refers to customers who stop doing business with a company. As you can imagine, predicting which customers might churn is incredibly valuable for a business — it lets them intervene and try to keep those customers around.

The company had a lot of data on their customers, from their purchase history to their browsing behaviour on the website. The challenge was to use this data to predict which customers were likely to churn in the future.

Given the size and complexity of the data, I decided to use H2O’s AutoML for this task. I had used H2O before and was impressed by its performance with large datasets. But this was my first time using its AutoML functionality, and I was eager to see how it would perform.

Getting Started with H2O’s AutoML

Getting started with H2O’s AutoML was straightforward. After loading the data into H2O, I set up the AutoML run by specifying the target variable (in this case, whether or not a customer churned) and setting a time limit for the run. Then I let H2O’s AutoML do its thing. It automatically preprocessed the data, engineered features, and trained a variety of machine learning models. Once the run was complete, it presented me with a leaderboard of the best models.

Step-by-step guide on how to get started with H2O’s AutoML.

Step 1: Dataset

I owe a big thanks to Ankit Verma for the dataset I used in this project, which I found on Kaggle. If you’re interested, you can grab the dataset for yourself right over here.

Step 2: Install the required Libraries

To start working with H2O’s AutoML, you need to install several Python packages. Run the following commands in your Python environment:

!pip install requests
!pip install tabulate
!pip install "colorama>=0.3.8"
!pip install future
!pip install h2o

Step 3: Initialize H2O and Set Up AutoML

We need to get H2O up and running. To do this, we’ll start by importing the necessary libraries. We’ll need the h2o library itself, of course, and from that, we're also going to import H2OAutoML.

import h2o
from h2o.automl import H2OAutoML

Once we’ve got those, we can initialize our H2O instance. We’re also going to set a limit on the maximum memory size that H2O can use. For this project, I’ve set it to ‘16G’, but you can adjust this according to your system’s capabilities.

h2o.init(max_mem_size='16G')

Step 3: Import Data and Set Up Variables

Next, we’ll import our dataset and split it into a training set and a test set. We’ll also define our dependent and independent variables.

First, we’ll use the h2o.import_file() function to import our dataset. Make sure to replace “Ecom.csv” with the path to your dataset if it's different. To have a quick look at our data, we'll use the df.head() function.

df = h2o.import_file("Ecom.csv")
df.head()

Next, we’ll split our dataset into a training set and a test set. We’ll use the df.split_frame() function for this, specifying that we want 80% of the data in our training set (leaving the remaining 20% for our test set).

df_train, df_test = df.split_frame(ratios=[.8])

Now, we’ll set up our dependent and independent variables. Our dependent variable (the one we’re trying to predict) is “Churn”. Our independent variables (the ones we’re using to make our predictions) are all the other columns in our dataset, except for “Churn” and ‘CustomerID’ (we remove ‘CustomerID’ as it is unique for each customer and doesn’t contribute to the model).

y = "Churn"
x = df.columns
x.remove(y)
x.remove('CustomerID')

And there we have it! We’ve imported our data, split it into a training set and a test set, and set up our dependent and independent variables. Now we’re ready to start using AutoML.

Step 4: Train AutoML Model

With our data set up, we’re now ready to train our AutoML model. For this, we’ll set up an H2OAutoML object, specify some parameters, and let it train on our data.

First, we initialize our AutoML model. We’re setting max_runtime_secs=300 which means the AutoML process will run for a maximum of 300 seconds, max_models=10 which tells AutoML to consider a maximum of 10 models, seed=10 to ensure reproducibility, verbosity=”info” to get detailed logs, and nfolds=2 for cross-validation.

aml = H2OAutoML(max_runtime_secs=500,max_models = 15, seed = 7, verbosity="info", nfolds=4)

Next, we train the model on our training data with the train() function, specifying our independent and dependent variables and the training frame.

aml.train(x=x, y=y, training_frame=df_train)

We then retrieve the leaderboard which ranks the models based on their performance.

The model that performed the best was a StackedEnsemble, which was a bit of a surprise — I had expected a deep-learning model to come out on top. But that’s the beauty of AutoML: it can uncover insights and deliver results that might not be obvious if you were doing everything manually.

lb = aml.leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])

We can check the performance of the top model (the “leader”) on the test set.

aml.leader.model_performance(df_test)

At the end of these steps, you should have a trained AutoML model and made some predictions on your test data. The performance metrics will give you an indication of how well your model is doing. In this case, the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Root Mean Squared Logarithmic Error (RMSLE), and R² values are reported for both training and cross-validation data. These will give you a good indication of how well your model is performing.

Step 5: Making Predictions on the Test Data

Now that we have our top-performing model, it’s time to put it to work on our test data. This is done by calling the predict function and passing in our test data.

pred = aml.predict(df_test)

To check out our predictions, we use the head function, which gives us the first few rows of our prediction data:

pred.head()

This will show you the top few predictions made by our model on the test data. It’s always exciting to see what your model has come up with!

AutoML for Your Projects

Whether you’re a seasoned data scientist or a beginner in the field of machine learning, H2O’s AutoML presents a fantastic opportunity to streamline your model-building process. It’s a powerful tool that can help you discover optimal models for your datasets with minimal manual intervention.

H2O’s AutoML is not just limited to standard business use cases. It has proven effective in diverse fields. A recent example is a study on water quality predictions conducted by a team of researchers. They leveraged H2O.ai AutoML and published a paper detailing their findings and their use of H2O-3 AutoML in their study. The paper underscores the importance of precise water quality predictions for robust water management and environmental protection.

Water Quality Prediction with H2O AutoML and Explainable AI Techniques by Hamza Ahmad Madni et al.

In conclusion, H2O’s AutoML can be a game-changer for your machine-learning projects. It’s about time you add this powerful tool to your data science toolkit. So, dive in, experiment, and unlock the potential of AutoML for your projects. Happy Modeling!

A Hands-on Introduction to H2O’s AutoML: E-commerce Churn Prediction was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

Original Post>

Deployment Risk: AutoML & Machine Learning Without Expertise