I am building a system to separate fraudulent transactions, these will be then manually verified, helping me in turn build a labelled dataset over time.
For now I have transaction data and customer behavior Information.
I intend to do this like this:The possible fraud cases include:- Abuse of all cashbacks and discounts ( Coupons / Vouchers / Auto Refund)- Retailing
- Acquiring sensitive SKUs
Since I don’t really have labelled data, I am going with the unsupervised learning approach (isolation forest).I plan on having 3 modules : Users, SKUs, Localities For the last 2 I am suffering with setting meaningful thresholds, I standardized slope of sales trend and intercept, then divided them to get a compound variable which I am using to compare. Please share thoughts and or Resources.
This seems like it’ll be a really long journey without labeled data. Do you have an idea of how much fraud is in your sample set?