AutoML : Bringing in real artificial intelligence capabilities
By : Amit Meher
R&D Manager - Data Sciences
Traditional data science practices focus on solving a point problem after taking into consideration a specific dataset and domain at a given point of time. However, this may not be an effective strategy in terms of scalability and efficiency, as the same model may not provide optimal results when applied to a different dataset or domain.
A concrete example of this inefficiency can be seen in the process of predicting churners in the telecommunication domain. The churn model developed for a specific Communication Service Provider (CSP) may not yield good results when applied to a dataset pertaining to a different CSP. This could be due to the difference in the subscribers’ churn behavior across CSPs and may require a different class of learning algorithms and hyperparameter settings to yield optimal accuracy. Also, this model, customized for a specific OpCo can’t be applied across other CSPs because of the heterogeneous nature of data types, data distribution, skewness, missing values, outliers, etc. associated with them. Consequently, data scientists end up building customized models on an CSP level, which results in significant overheads. Moreover, within the same CSP, subscribers’ behavior may change over time. For instance, subscribers’ behavior in pre-WhatsApp and post-WhatsApp periods are significantly different. This could potentially lead to invalidating the old model and building a new model from scratch to account for the new behavior. The new model will in turn require manual monitoring of its performance on a day-to-day basis, identification of abnormalities in the model’s performance and perform the steps from scratch. These processes consume a lot of the precious human time. This becomes even more challenging if the same churn model, developed for telecommunication domain is applied to predict customer churn in the banking domain.
So the question to address here is ‘can we formulate a generic framework, which would be agnostic to the dataset or domain, automatically recommend an optimal model with respect to a given task, without or minimal involvement of human machine learning experts, thereby overcoming challenges mentioned above?’. In this context, we will discuss one of the most promising research areas namely, Automatic Machine Learning (AutoML), which can bring in enormous capability to the data science and machine learning arena in the years to come.
An AutoML framework seeks to automate the process of designing and optimizing machine learning pipelines to solve data science problems. In this context, the level of automation could vary depending upon the complexity and scale of the problem. Basic level of automation aims at automatically discovering an optimal set of hyperparameters for a given machine learning algorithm with respect to a given dataset. The next level of automation focuses at discovering an optimal combination of machine learning algorithm and its hyper parameters which works best on a given dataset. A more advanced level of automation is to discover an optimal end-to-end model pipeline which includes a data preprocessing step, a feature preprocessing step, an algorithm selection, and hyperparameter tuning step. However, performing all these levels of automation require multiple iteration of model’s execution under limited budget and resources making it challenging. To this end, Bayesian optimization is a promising strategy. Bayesian optimization has advantages over other naive parameter search strategies such as grid search and random search, especially when time hungry algorithms such as Support Vector Machine (SVM) and deep learning models are used in an AutoML framework. It intelligently searches the parameter space using a Gaussian process to determine the next best parameter combination to evaluate. AutoML systems could use Bayesian Optimization in the joint space of design choices namely, data preprocessing, feature preprocessing, algorithm and hyperparameter selection to discover an optimal model pipeline for a given problem. This will result in considerable increase in efficiency when it comes to deploying packaged analytics models, which are one of the most valuable assets of enterprises focusing on AI.
In the context of big data, data often arrives in streams in real time. This brings in the possibility of data being deviating from the normal i.i.d case and hence exhibit concept drift, which could make the previously built model ineffective to be used anymore. In such a scenario, model needs to be capable enough to detect concept drift and adapt to it automatically without manual intervention. Bayesian online change point detection algorithm is one such prominent algorithm which can detect change points (which characterize the concept drift phenomena) by probabilistically modelling the distributional variances of features within the data stream seen so far. AutoML systems should leverage this feature to bring in more completeness and actionability. Another area AutoML could be made more effective is by storing its previously learnt knowledge (ML pipelines or Model Configurations) pertaining to different tasks, datasets, and domains, and applying it intelligently to discover an optimal initial pipeline to start with, given a new task and dataset. In other words, AutoML systems should be able to learn from its own historical experience in a lifelong learning setting.
One of the active research areas pursued by Flytxt is to build a generic and scalable AutoML framework which can provide the above-mentioned capabilities. From Flytxt’s perspective, AutoML could help in many aspects, enhancing the AI capability of an organization. With AutoML, Data scientists could be relieved from doing repetitive tasks required to build machine learning pipelines and can now focus on solving complex data science problems and devising new algorithms. Development and maintenance of packaged analytics models will become easier and it will no longer require extensive human intervention. The model can automatically identify the concept drift phenomena and take the decision on whether to discard the old model or update it in real time, enabling an auto monitoring ability. Additionally, it will ease the dependency on data scientists for solving common data science problems and brings in self-sufficiency. In other words, AutoML framework would be intelligent enough to maintain the entire lifecycle of a model, hence providing the true AI capability.