Regression modelling for solving transportation problems
By : Flytxt Data Science R&D Team
KDD CUP is a globally recognised data science competition where data scientists and researchers from reputed universities as well as industries across the globe participate to apply their knowledge and skills to solve challenging knowledge discovery and data mining problems. It is organised annually by ACM SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining – a leading professional organisation of data miners).
The challenge this year
This year’s KDD Cup problem pertains to building predictive models capable of forecasting average travel time of vehicles as well as tollgate traffic volume on road networks. The problem posed on this competition is a difficult and significant one for traffic authorities as well as end users. Traffic patterns evolve with time, as several factors such as seasonality, weather conditions, business hours, holidays, road accidents, etc. influence it. Stochastic variations of traffic patterns make the problem more difficult to model with traditional approaches.
Traffic authorities need to plan their traffic management strategies including traffic control, diversions, route maintenance, etc. in such a way that it causes minimal disruption and inconvenience to users. End users can also plan their travel effectively by knowing about congestions and expected delays in travel routes. Many mapping apps provide such capabilities today. Applying analytics algorithms on historical traffic data acquired via mobile apps as well as various sensors enable such capabilities. From the above challenges and/or motivations, KDD Cup 2017 organisers proposed following two tasks for predicting future traffic flow and estimate time of arrival.
- Task-1: This task involves predicting average travel time (ATT) of vehicles for a specific route for every 20-minute time window.
- Task-2: This task involves predicting average traffic volume (ATV) at each of the toll gates for every 20-minute time window.
Flytxt data sciences R&D team took part in this challenge under the guidance of Prof. Santanu Chaudhury and Dr. Brejesh Lall from the Department of Electrical Engineering, IIT Delhi.
A total of 3547 teams across the world participated in KDD Cup 2017. The organisers provided some historic datasets capturing historic vehicle travel time, vehicle volume, route information, road link structure, information on tollgates, weather etc., pertaining to a road network in China. Flytxt team secured 34th position for task 1 and 30th position for task 2, landing within the top 1% of the contestants. Microsoft China secured the first position for both the tasks.
Flytxt approaches for average travel time and traffic volume predictions:
The average travel time for a particular route depends on route length, route width, link length, number of links, type of the vehicles, historic traffic patterns observed on the route/link, etc. We modelled the problem as a regression problem and utilised a variety of regression techniques including K-nearest Neighbours (KNN), support vector regression (SVR), deep learning, boosting (XGBoost Regressor), etc. We preprocessed the historic datasets and generated different features based on route properties (e.g. route length, width, historic average travel time for each route, etc.), link properties (e.g. link length, link travel time, etc.), weather conditions, holidays, etc.
We built six promising models based on route level and link level information to estimate average travel time. For model performance evaluation, Mean Absolute Percentage Error (MAPE) was used. The cross-validation MAPE scores of all six models are shown in Figure-1. We utilised model ensembles to combine individual models to improve the overall predictive performance. A median ensemble model produced the best MAPE scores on unseen data.
Similar to task-1, average traffic volume prediction was modelled as a regression problem. We also preprocessed historic datasets and generated different features based on tollgate information, route information, weather, etc. We trained several models including random forest regression, boosting (XGBoost Regressor), and regression with dynamic bayesian networks. XGBoost Regressor produced the best cross-validation MAPE score (0.1355) among all candidates. XGBoost Regressor model achieved 0.1607 MAPE score on phase-2 data.
A great learning experience
The challenge provided a good learning experience – learning from evolving data (non-stationarity), accounting for stochastic factors such as weather, controlling overfitting and model complexity, etc. We also observed significance difference of leaderboard MAPE scores between two phases of unseen datasets. This could potentially be attributed to evolution of data distributions across the two phases making phase-1 modelling approaches not to work so well for phase-2 data.
Overall, being a part of the KDD 2017 was an enriching experience and gave us a great opportunity to rub shoulders with renowned researchers and practitioners in data mining, knowledge discovery, data analytics, and big data from across the globe.