Machine learning: how much data is enough data?

December 23, 2020
Florin Bulgarov, Chief Data Scientist Postis, reveals some of the secrets behind algorithms
About Us
Updates
data-center
By 
Florin Bulgarov

Machine learning. It is already in most pieces of technology you’re using, and the level of algorithms integration is increasing and accelerating. Countless articles were published and many companies, specialists or wannabes talk about AI, machine learning and data science, so it seems that the subject comes at hand to anyone. But is it so?

While everyone paints machine learning (ML) as inherently important, people often forget that it is useful only when it solves or improves an existing solution to a given problem. It can’t fix anything by itself (at least, not yet). So firstly, a company must correctly define a problem which machine learning can solve.

Moreover, even if software development has been around for decades, ML is still a complex and relatively new area which requires a learning curve itself. Twenty years of software development experience will not help you too much to build and deploy a ML model. Thus, the second issue to be considered when approaching machine learning is the experience in data modeling and data analytics of those who develop the algorithms.

Even with a correctly defined problem and the right people, ML still has to pass the most difficult barrier: it needs data (and lots of it). As The Economist wrote, “data is a resource like both sunlight and oil. Like sunlight, it needs to be collected and only a percent of what is collected can be used. Like oil, it needs to be refined.” Thus, any ML strategy must be viewed in terms of data potential, which translates in both quantity and quality.

Simply put, machine learning helps you identify patterns in existing datasets, isolate the factors that contribute to such trends and anticipate future events. Controlling and adjusting certain parameters would allow you to change behaviors before they happen, so you can increase the chance for your desired outcome.

Keeping this in mind, data quality becomes highly subjective and task-related. Therefore, we will focus on quantity. Let's see some of the most usual algorithms at work. Spam e-mail? It moves to junk. Road obstacle? The car stops. But in order to recognize spam e-mail, you need lots of e-mails, spam or not; and to recognize a road obstacle, you need lots of examples of roads, obstacles and obstacles on roads. Generally, there is no upper limit to how much data is too much (except hardware limitations), but there are strict requirements on the lower bound. The minimum amount of data is also dependent on the choice of model and the number of uncorrelated attributes in your model. A simple Linear Regression with a few features won’t require too much data, but it won’t solve any complex problems either.

Some companies try to get around these requirements by using an already available dataset, built by third party organizations or by those companies themselves, but for other purposes. This approach might work well in academia and Kaggle competitions where the focus is on the model, but in the industry practice, the focus is on the problem needed to be solved. There is rarely a dataset available for your company’s specific needs, and even if there is, its potential is significantly lower than your own data, collected specifically for the given task.

"At Postis, we use machine learning right in our core business: deciding which courier option is the best choice for each and every individual order. This problem is fairly complex and dependent on many different factors, that's why the data collection process cannot happen in just a few months. In 3 years of activity with hundreds and hundreds of different use cases, business objectives, company or product typologies, countless delivery options, supply-chain complexities being implemented within the Platform, we have now the largest and most complex logistics and delivery dataset available in the market".

Florin Bulgarov, Chief Data Scientist Postis

Here are just some of the aspects decisive for the minimum amount of data needed to train your model.

Time period

The magnitude of the dataset needed for machine learning is strongly correlated with the seasonality of the data. In the case of Postis Platform, periods such as Valentine's Day, March 1st / 8th, May 1st, Summer Holiday, Back to School, Black Friday or Christmas bring daily order volumes 5 to 10 times bigger than those in regular sales cycles. While these moments in time are highly profitable for companies involved in distribution and delivery services, they are often overwhelmed and struggle to meet the market demands. Moreover, home deliveries also depend on the weather and season. Thus, seasonality can have a significant impact on the carrier performance and to take it into account, there has to be at least a full year worth of data for each client and courier.

Diversity

Complex tasks require many uncorrelated attributes in order to capture the same data from different views. Couriers perform differently depending on the client, parcel type, size, weight, value, geography, whether they have to collect cash on delivery, and many other aspects. Not only that you need relevant and sufficient data for each of these attributes, but you also need data for most or their combinations to identify and capture influences and correlations. This means, for example, that it is not enough to have shipment history for each of the 41 counties in Romania, but you need to have a significant number of shipments in each county, for each client, using each of available carriers, parcel type and so on. Even more, what happens when the same courier performs differently in the same county? You’ll need data with finer granularity on smaller geographical areas, such as localities. Each new level of detail increases exponentially the magnitude of the dataset needed for your task automation and data-driven decisions.

Relevance

If we would choose a retailer's specific objective - improving delivery time and increasing the number of parcels delivered within the promised time - the problem we’re trying to solve is binary: deciding whether a shipment will arrive on time or not. Despite this, even for such a simple problem, the dataset needed to solve it is huge, as not all data previously collected is useful and relevant. Depending on the retailers, their relationship with customers and the type of products they deliver, the on-time delivery percentage within shipment history varies between 85 to 95%. This means that only 5 to 15% of data from previous shipments will include those factors which we need to analyze in order to improve the performance: those who did not get on-time to their customers. Thus, even if you wait a long period of time to account for seasonality and even if you have a diverse dataset, you still need to wait until you have enough examples for the task that you want to improve.

To summarize, machine learning works behind countless technologies that improve and simplify the most diverse aspects of our personal and professional lives, when repetitive, frequent actions are involved. If a company wants to use process automation to become more efficient, flexible and fast, digital technologies are key. But, before taking any decision on using machine learning for predictive analysis, data-driven decisions and automated optimization, one has to define his own strategy for the management of digital information and to identify, based on at least 3 dimensions - time period, diversity, relevance - which is the dataset necessary and sufficient to start.

Established in 2016, Postis is the first Romanian LogTech startup to disrupt the logistics and transport sectors through IT technologies - open digital platforms, machine learning algorithms, data-driven decisions and process automation. Launched commercially 3 years ago, Postis Platform integrates all significant distribution and delivery service suppliers with any company from Romania and CEE that needs to increase productivity, simplify operations, optimize costs, scale-up, adjust the business model or expand to new territories.

This article has been also published in Market Watch. To read the original article in Romanian, access their website here.

portrait of man with glasses smiling

Transform your logistics and delivery through data science and state of the art digital tech.