Dom Owens is a Data Science Intern at Data Cubed. In this blog post he explains how Random Forest models work, and how machine learning can generate value for your business.


Rise of the machines

If you could predict something about your business, what would it be?

In recent years, there’s been an explosion in computing power. At the same time, there’s been a vast increase in the amount of data being collected. Combined, this gives businesses the potential to use machine learning techniques to pick up on patterns hidden inside data. These patterns can then help us make predictions — valuable information for a business in rapidly-changing industry.

Think of any piece of information that you’d really like to know but don’t yet. How long it takes for a subscriber to unsubscribe, the value a new customer might bring to your business over their lifetime, the sales figures of a new offer you’re thinking of launching. All of these are valuable pieces of information when you’re trying to plan for the future.

Machine learning techniques can not only help you make predictions, but can also tell you how accurate that prediction is and how the other information you have is influencing it.


Exploring the Random Forest

Suppose you have a large amount of data with lots of measurements and you want to know one of two things: a number (say, the number of units a product will sell) or a class (the newspaper that a customer will buy).

The Random Forest algorithm — a set of rules for making decisions — can handle this situation elegantly, giving accurate and unbiased predictions. 

An example of an informal decision tree

Let’s start from the roots. You’re probably familiar with the concept of a decision tree as a way to find the answer to a problem. We ask a question that has two possible answers — often yes or no — and follow the answer to another question. We repeat the process until we get a final answer that we’re satisfied with.

The Random Forest algorithm does the same thing. It asks a question of the data to split the current branch of the tree. It selects the best question to ask from a small group of questions, opting for the question that splits the data most evenly.

This gives us a tool for making predictions. A new set of data can be fed through the decision tree, outputting a prediction for each new observation. But — a single decision tree can veer far off course if it gets caught on irregularities in the data, leading to less accurate predictions. 

Voting average

Thankfully, we can remedy this by using the wisdom of crowds. We choose a large subset of the original data and create a tree to fit it. Then, with a new subset each time, we fit many more trees, each capable of making predictions for new data. Since each tree fits different data, and the trees split according to random rules, each tree will be unique and give different predictions for the same data. 

We then take an average of all the predictions. This gives us a more accurate prediction than one tree alone, either the most popular vote for classes, or the mean average for a predicted number.

We can even see which factors are important in influencing how the trees make their decisions. Perhaps product sales depend heavily on the weather conditions and your website traffic, but less so on how much is spent on marketing the product.


Let’s look at an example

Let’s consider a telecoms business using this example dataset, and see if we can predict how new customers will behave. The dataset contains information on previous customers, such as their gender, the type of contract they hold, and the charges they’ve incurred, as well as whether they’re no longer a customer. Being able to predict customer churn can be very useful for making decisions about pricing and marketing. 

The Error Plot tells us that, as we fit more trees, the model becomes more accurate. This shows us that taking the average of multiple trees is really valuable.

The Variable Importance plot shows us how important the information we’re measuring is. Larger readings correspond to more importance. We can see that charges and type of contract are most important, while the customer’s gender and whether they have a partner are least important.

When we try out the model on new data, the model gets predictions right around 80% of the time. It does well at identifying customers who don’t churn, but less well at identifying customers who do. Given that 73% of the customers in the data set did churn, this is a useful improvement on randomly guessing if the customer churns based on population percentages. We might improve on this by collecting more data, recording more information for each customer, or by adjusting the settings of the model algorithm.

The test set error rate tells us the model gets it right nearly 80% of the time.

Many other problems can be solved in a similar way, which brings us back to our original question — if you could predict something about your business, what would it be?


We’ve developed a new predictive modelling tool using Random Forest algorithms, which we’re using to help our clients peer into the future. Our new tool allows us to identify your most valuable clients now, and then lets us go one step further to predict which clients will be the most valuable in the future. We can predict other things too, such as customer churn, or sales figures for a new offer. And we can show you the results within days. 

We’re really excited about this technology. If you are too and would like to know more, email us at hello@data-cubed.co.uk or call us on 0117 25 10 100.