Highlights from my 9 year journey in applied machine learning and data science, super grateful!
Tackling social media bias with machine learning - TikTok
Motivation:
Fake News propagation on social media is a widely prevalent problem
People engage in promoting themselves, some news, some agenda using hook or crook techniques
Action:
Multi level action was needed as the problem was huge and I broke it down in smaller chunks
1) Make sure there is a good way to validate the existing label creation process
(Could we do our own experiments and say identify fake accounts, and then featurize them)
2) Supervised models were meant for prediction and unsupervised for labeling
(Utmost care was taken to not use features/correlated features from unsupervised part in supervised part)
(That would be cheating otherwise!!!)
3) Various user profile segmentation was done, to identify high risk group, to identify cyberbullying group, to identify fake followers group etc. Actions would be different for different people. Some would need more models, some would be caught in just one go.
4) Supervised models that I put in production included:
a) behavioral models based on neural nets
b) sequential activity falters based on Bayesian and turbulence style models
c) group cheating networks based on graph models and label propagation
d) frequency models based on boosted trees
5) Our metrics were both internal and external and we set up pretty ambitious goals for ourselves
6) The biggest challenge and satisfier for me - creating at least a thousand new features over time based on domain experience, type of activity, studying the app, understanding the changes etc.
Results:
1) On external third party evaluations by government organizations, we improved on multiple level of manipulations
2) Fake Follow and Fake Like detections increased by more that 30% and we kept improving consistently over 2+ years
Incremental machine learning - WalmartLabs
Motivation:
Regular Model Decays and Model Fine Tuning would take a lot of time
Adversarial Machine Learning problems often need quick actions in real time environments
And even tougher, how do we know what user profiles are engaging in fake activity?
Action:
Incremental Machine Learning to our rescue after a lot of research and brainstorming
1) Choose a time window to observe model performance and feature drifts
(Time window itself is a hyper parameter that you could learn eventually)
2) See how different is the old and new data (old = on which the model was trained; new = data in the recent time window you choose)
3) Fine tune your model is their is a significant different in data
(Data distribution different --> Features distribution different --> Model performance different --> Goodbye Decay!!!)
4) How to establish significance? Various tests to check distribution difference
(Or just let the models do, wait, how??)
5) Many models today could be made incremental, we created an in-house incremental boosted tree
(Note: Neural Nets by default are easy to tweak incremental models, "please save those weights will you")
6) Of-course, in case of no such difference in new and old data, let the model run, let's not bother it.
Results:
1) More Fraud/Risk profile capture, we catched a whopping 11% more risk profiles on a month-on-month basis
2) Time, cannot describe how time saving this activity was, tough to put in numbers, but we were saving a week every month
3) Other problems in our team, sister teams took notice, and they implemented similar approaches
References:
1) https://dl.acm.org/doi/abs/10.1007/s11063-019-09999-3 (Incremental Boosted Trees)
2) https://arxiv.org/pdf/2210.04865.pdf (KL Divergence Drifts)
Outpatient Services Clustering - Carnegie Mellon University
Motivation:
(Work at Carnegie Mellon University in partnership with a local clinic in Pittsburgh, USA)
Can we check if the medical dosage given to Kidney Disease patients is working well
Can we monitor their activity, sleep, food, and analyze some patterns to help them better
Action:
1) Started with getting data in workable format
a) Everyone was given a activity tracker (fitbit style) and we got data for a 3 months timeline
b) Patients were asked to maintain a food log
2) We wanted to group patients and find out people who were doing well or not
(Idea was to see if the medical dosage associated with a particular person was appropriate enough)
3) We tried various clustering techniques, and eventually what worked well was a Mahalanobis distance based K-means clustering
a) K-means clustering by default in most packages relies on Euclidean Distance
b) We decided to implement with Mahalanobis distance esp. because of correlations shown in the data)
Results:
1) We presented our solutions and groupings to the Medical staff for more better domain understanding
2) They started changing dosages based on our clustering group finding to further evaluate the improvements
Mortgage Risk Data Science - Barclays
I would say this was the project at Barclays, that got me interested in data science and applied machine learning
1) I was working as a software engineer for an mortgage application that Barclays offered in U.K. and South Africa
2) I frequently worked with the risk mitigation team that worked on Mortgage risk, customer profile risk etc.
3) I used to have engaging conversations with the machine learning lead on how they were tackling this challenge
4) I was super interested in learning more about their statistical and machine learning models
5) This was the first time I saw a very detailed application of weight of evidence based logistic regression in production
6) I was inspired by the team's approach to applied machine learning and their interest in understanding customer profiles
0) After some contemplation for a year I decided to study and do a masters' in the same area to gain thorough understanding of data science, statistics, machine learning etc.