Infinite data sets and the evolution of data science
By : Dr. Prateek Kapadia
Chief Technology Officer
Big data essentially concerns itself with large collections of data about events and transactions recorded from the past. Allied terms like “fast data” extend this further and fashion faster updates to this history. But the underlying analytics processes on big data analyze the past; to predict the future. Data sets in this discourse are large, but always finite.
Fundamentally, however, the physical universe is different. Data sets that correspond to digital capture of information from events and transactions by and among humans and machines aren’t actually finite – these events, transactions and their data capture will also exist in the future. For example, customer behaviours have events in the past as well as in the future. This philosophy of allowing data sets to extend to the (infinite) future requires the data scientist to think and prepare for a future beyond merely “big” data. Thus, data scientists that predict future customer behaviours will have to contend with infinite data sets.
Dealing with infinite data is different
Theoretically, some function must exist that will map the historical data set (the history) to target values (the prediction). The canonical machine learning problem is to find a computationally efficient approximation of this function. Data scientists then build predictive models using this approximation to foretell future behaviour.
Approximations are derived from compactly representable properties of data sets – most commonly, the statistical distributions that fit the set. However, properties of finite data sets vary at a relatively lower rate than those of infinite data sets. Changes in the statistics of the data must be dealt with on an ongoing basis for infinite data sets. Extra-sensory unrecorded events like political upheaval and natural disaster, and changes in macro-economic, cultural and other trends are inevitable however; and these lead to non-homogeneity and non-stationarity of the distribution.
Furthermore, data set representations will change over time for infinite data sets, including global notions of outliers and patterns, and local notions of improved or degraded ability of sensors and variability in availability of streams. Data scientists today build their models assuming homogeneity, stationarity and regularity of representation in the data set. Then they retrain the models when their (perhaps manual) observation of the target variables belies these assumptions.
Target values are used for decisions that affect personal and professional lives. Human training of models, however, implies that there is an allowance for inefficient decisions until the training is complete and the model is retrained. Thus models needing frequent manual training don’t suit the cadence and assurance of decisions required for infinite data sets, and data science must evolve to address this problem.
Methods to deal with infinite data
Infinite data sets require data scientists to advance models and test them at a speed comparable to the arrival of new data; and to balance this speed with the accuracy of the decision the models entail. Also, multiple models may work on multiple data representations (like different granularities of aggregations); hence the ability to have many models being evaluated in parallel is necessary. An evolving conceptual tool for such parallel evaluation is ensemble modelling – choosing a combination of outputs from multiple models. We anticipate more focus in the data science community on building efficient, automated ensemble models to deal with infinite data sets.
The data scientist must also prepare for adapting to many variances in data representation and data set properties over the lifetime of the model. Thus the need for data cleansing and data imputation must be sensed and addressed with time and quality guarantees; as should the need for revising approximations in response to property changes. For hand-crafted (non-AI) models, supervision (semi- or full) for training is inevitable; but preparation for the aforementioned variances makes its automation critical. AutoML engineered machine learning pipelines will progress to solve these challenges. Data scientists must keep track of developments in this nascent field and adopt state-of-the-art algorithms for automated feature extraction, construction and transformation, automated handling of skewed and missing data, and automated post-processing and calibration of target values.
While handling the expected heterogeneity and non-stationarity in the distributions in the data set and the changes in the data representation; the data scientist should also be able to make new conclusions basis new correlations. In our parlance this means that evolving approximation functions map new information in the target variables. This variation in the target variable mapping is called concept drift. The data scientist would need mechanisms (automated, as above) to detect and correct for concept drift. While this field is evolving (even a precise definition of concept drift eludes the machine learning community today), the data scientist must closely follow developments – like methods that trigger detection of drift, synthetic drift data generators for model testing, and new classifications of drift handling and correction methods.
In this article we increase the awareness of the knowledge worker for the motivations for the evolution of data sciences. We have discussed the handling of infinite data sets and the different challenges that infinite data sets pose, like variations over time in statistical properties of homogeneity and stationarity. We advise the data scientist to prepare to apply techniques from key areas that will direct the evolution of data sciences for infinite data sets such as ensemble models, AutoML and concept drift.