I’m a Senior Data Analyst at eCabs Technologies. But when people ask me what I do all day – I tell them I’m storytelling.
Data is a collection of raw and discrete values that make no particular sense at first glance. It usually sits inside a data warehouse, which not only stores, organises, and manages the data but also allows querying and quick analysis. It is a core component of business intelligence and creates a space for number crunching, reporting and scientific study.
As a marketing data analyst, my job involves collecting, organising, and analysing all the relevant data to help inform and sometimes answer business decisions, primarily centered around the Marketing department’s needs.
At its core, data analysis is the process of using statistical and mathematical techniques to make sense of the information available to us so that I can turn what looks like Matrix-style numbers into stories that even non-technical personnel can understand.
Whether I’m looking for patterns in the number of rides requested at particular times of the day or trying to quantify the reasons for cancelled pick-ups, app open sessions or passenger ETA, what I’m really doing is asking questions to tell better and more relevant stories that can eventually answer some vital business questions.
So, while a lot of it is invisible to the naked eye at first, what I’m doing is uncovering information. By observing users under an analytical microscope and looking at their interaction within the ride-hailing industry.
I will use this blog space to talk about some of the nuts and bolts of what we do here at eCabs Technologies as we try to improve your mobility experience.
But this first story is special to me.
Asking the right question
In early 2023 I made use of a powerful yet relatively simple supervised learning method in my data analysis toolkit. It is called simple linear regression.
This technique allows me to investigate the relationship between two variables. These are often referred to as the independent variable (X) and the dependent variable (Y). In this case, the user volumes and driver hours respectively. By using simple linear regression, I can determine how changes in one variable affect the other variable.
I applied a linear regression analysis to a large data set that contained a few years’ worth of values for both the volumes of rides of eCabs users and that of partner driver hours.
This came about after asking this question. “How many more driver hours would it take to make a noticeable impact on user volumes?”. The answer may be intuitive to some, so much so that you may have already guessed what type of relationship exists here, but to what degree?
I wanted to be able to quantify this to a relatively high accuracy. And be able to approximate how many more people would request rides if there was a controllable and known number of increased drivers available at a given time.
Doing my homework
Before applying this technique, I needed to first ensure that my data abided by and respected the standard rules and limitations of linear regression. As with any algorithm, we need to check the foundation of assumptions before we apply it. Otherwise, any analyst runs the risk of faulty and misleading results.
The first is that simple linear regression assumes there is a linear relationship between the independent and dependent variables.
This may not always be the case though.
There may be non-linear relationships or interactions between the variables that are not captured by a simple linear mode. In our scenario we assume linearity over large scales.
Other limitations include the assumptions of independence, homoscedasticity, and normal distribution.
If we do not respect these assumptions, then applying the algorithm anyway would provide errors and inaccuracies in the results that would deem them useless.
Outliers and influential data points may also distort the result, impacting the estimations. But for our exercise we may assume that these are all respected.
Therefore, while useful analytical methods may be used for making predictions, it is important to research and respect their limitations. As well as carefully evaluate their assumptions and ensure the data follow in their shadows, especially when considering the potential sources of error when interpreting the results.
“I used a very simple approach”
After carrying out this preliminary analysis, I adopted a very simple approach. That of extracting the two relevant fields from our data warehouse, and loaded them into arrays in Python.
I imported a few data science toolkits into my script, namely sklearn and sklearn.metrics.
I then split the arrays into training and testing sets as part of this learning algorithm and in order to use them in the relevant package.
The model was trained using these sets. And immediately made the necessary predictions as part of linear regression.
The resulting coefficient was outputed together with the mean-squared error to describe how well these two variables were related and to what degree they can be ‘trusted’.
Using best practices in data analysis
A simple line graph was fitted to the scatter plot of the dependent and independent variables to better display the relationship between them.
This forms part of the best practices in data analysis and science as plotting results is always the most concise and diligent way of communicating results. It also comes full-circle regarding the story-telling part of my expertise since a picture speaks a thousand words.
I also found an equation for this fitted graph. So that, as simple as that, if we plug in the values for the number of driver hours that we have a direct impact and influence on, we can now approximate the user volumes that eCabs can expect.
We now have a way of influencing our independent variable (volumes) with our dependent one (hours).
This also gave a clear ‘maximum’ number of drivers that had absolutely no effect on the number of volumes. So much so, that no matter how much they increased past a certain amount, there were no noticeable fluctuations in users in the data and would instead cause the drivers to waste time.
This is saturation. It can be used to optimise hours on the road. Thus mitigating bad impressions and driver experience.
Improving customer and driver experience
This process taught me that it doesn’t always need to be impressive pipelines in complex code with a million data points.
Sometimes it is as simple as seeing how sets of variables grow or decay together, plotting a graph and finding the equation to best describe their relationship.
This is something that is done in beginner maths and physics. So next time a kid asks, “When will I use this in real life?”, get them to read this.
In the end, I settled on a multiplier that predicts passenger volumes in relation to the number of drivers out on the road with less than a 10% error margin.
The data said, ‘Hey if you put out say X more drivers at this time, you increase the probability of securing a passenger by Y’.
This changed how eCabs manages its relationship with all partner drivers.
We could see when we needed to incentivise for the supply of driver availability, and when we did not. Thus, ensuring there is no saturation of drivers.
This did not just improve customer experience, but by transitivity, that of the drivers working on the eCabs platform too.
For eCabs, we translated the formula into cost analysis, and revenue projections. It was even fed into marketing and operations plans.
It was a win3.