Approximate K-NN model improves upon the classical K-NN Model to lend it scalability in a telecom environment
By : Flytxt R&D Team
Identifying subscribers having similar interests and usage patterns and grouping them together is a very useful data mining technique in telecom. This provides meaningful insights about the usage characteristics of a set of subscribers and helps telcos in devising appropriate campaigns and recommendations. K-Nearest Neighbor (K-NN) model is one of the classical algorithms that is commonly used for this purpose. Given a set of subscriber attributes, it attempts to find K (a variable) number of subscribers with similar attributes. One of the advantages of K-NN model is that it requires no training but can learn a very complex model. However, one of its major challenges is its limitation in scalability as it uses brute force searching in the entire database. This could take a considerable amount of time in a typical big data environment of telcos. Further, if the query consists of many subscriber attributes, it is impossible to perform real-time analytics on the data.
Flytxt has developed an Approximate K-NN model based on a Locality Sensitive Hashing technique (LSH), which can address this issue of scalability without compromising much on the accuracy. Using this method, subscribers are segregated based on the extent of similarities in attribute, either in the same bucket or in the nearest bucket. This arrangement helps in efficient storage and querying in the database.
Flytxt utilized this model to build an approximate K-NN classifier that can provide classification results by searching on fewer samples and can maintain accuracy levels of the original model. We used a large dataset of a Tier-1 Operator, where the goal was to predict which subscribers are going to churn in the next 15 days. Each subscriber’s feature profile consisted of 36 KPIs, representing their historical usages for last 3 months. Our classifier gave similar accuracy (F1-Score) to a brute force K-NN classifier by searching on 1/100th of the samples used by the latter. The figure shows the above accuracy statistics of brute force K-NN classifier and LSH.