This is a self-paced project for SQL learning and practicing. The data is from IMI BIGDataAIHub, synthesized data from Scotiabank.
I firstly import raw datasets into Microsoft Azure Database, and then use SQL Server Management Studio (SMSS) to process the csv files into SQL tables. After cleaning, I created an analysis table that contains kyc data and actual transaction records for explorative data analysis. In the EDA, I convert raw transaction number into several indicators such as income-transaction ratio, volatility, and high-value ratio. Then I used window function to compare every customer within their occupation group and combined it with indicators to tag suspicious customers.
- The most suspicious customer criteria must be the kyc mismatch. For example, when the customer's occupation doesn't have a usual income (i.e. under 1000), but the total transaction amount is very big, or the high-value transactions are very frequent, then the customer's behaviour is not consistent.
- Identifying fraud is not always by one hard threshold. I observed that a lot of non-fraud customers have high volatility or high-value transactions, so the high volume is not a fraud signal. However, when combining it with other indicators such as group deviation (window function) and transaction ratio, the suspicious candidates become fewer and more precise.
- The global pattern doesn't contribute very much to AML analysis. During EDA, when I look at the entire table, no combination of indicators will perfectly separate fraud data from other non-fraud data. But when setting a particular scene, such as all customers from Ontario, have the same occupation, or have a similar total transaction amount, the rest of the indicators may distinguish fraud by abnormal spikes, a larger high-value ratio, or stable volatility compared to normal customers.