In recent times, the number of fraudulent transactions has increased drastically due to which credit card companies are facing a lot of challenges. For many banks, retaining highly profitable customers is the most important business goal. Banking fraud, however, poses a significant threat to this goal. In terms of substantial financial loss, trust, and credibility, banking fraud is a concerning issue for both banks and customers alike.
Problem Statement
Finex is also not really equipped with the latest financial technologies, and it is becoming difficult for the bank to track these data breaches on time to prevent further losses. The Branch Manager is worried about the ongoing situation and wants to identify the possible root causes and action areas to come up with a long-term solution that would help the bank generate high revenue with minimal losses.
Data Understanding
The data set contains credit card transactions of around 1,000 cardholders with a pool of 800 merchants from 1 Jan 2019 to 31 Dec 2020. It contains a total of 18,52,394 transactions, out of which 9,651 are fraudulent transactions. The data set is highly imbalanced, with the positive class (frauds) accounting for 0.52% of the total transactions. Since the data set is highly imbalanced, it needs to be handled before model building. The feature 'amt' represents the transaction amount. The attribute 'is_fraud' represents class labeling and takes the value 1 the transaction is a fraudulent transaction and 0, otherwise.
Exploratory Data Analysis
To understand the data better, we can check the variables that are measured in the dataset.
Overall, a small percentage of transactions are fraudulent (0.52%)
Most people who happen to transact are between the age group (40-60)
We can also observe a few outliers in the transaction but these are naturally occurring variations and hence nothing will be done
There is no high correlation between any of the variables with each other
The transactions in red are fraudulent and blue are non-fraudulent. We can observe that most fraudulent transactions happen between 12 AM to 5 AM or 9 PM to 12 AM.
All variables are useful to detect fraudulent transactions. For further analysis, click GitHub link:https://github.com/Srikar-R/Decision-Tree/blob/main/Credit_Card_Detection_Srikar.ipynb
Using these variables, we now build different models to predict fraudulent transactions
Model Building
There are three ways to build the model. These models have been chosen since the dependent variable is categorical
i) Logistic Regression
ii) Decision Tree
iii) random Forest
We will pick the model which provides the least number of errors which is measured using precision and recall
i) Logistic Regression
We split the dataset into train and test variable
X_train, X_test, y_train, y_test = train_test_split(oversampled[X],oversampled[y] ,
train_size=0.7, test_size=0.3, random_state=100)
Then we build the model
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
We have precision and recall above 0.7 which is a good percentage of values. Now, we can test the same dataset with the decision tree
ii) Decision Tree
We get high precision and recall which are not good signs as it could be that model is overfitted. Lastly, we can check using the random forest model as well to test the dataset
dt_clf = DecisionTreeClassifier(criterion = 'gini', max_depth = 20, random_state=0)
dt_clf.fit(X_train, y_train)
iii) Random Forest
We can observe that the model built for random forests also shows high precision and recall
RandomForestClassifier(max_depth=20, n_estimators=50, random_state=345,verbose=1)
Now, all of them must be tested using the test dataset to check if there is no overfitting, and based on that, we can pick the best model
Model Evaluation
We can test the models based on the test data and see if they overfit or not
i) Logistic Regression
ii) Decision tree
iii) Random Forest
We can observe that the precision-recall values remain the same for test data as well
Conclusion
It is very important to have a high recall since you do not want to group fraudulent transactions as non-fraudulent. If there are fraudulent transactions being termed as non-fraudulent, then there are so many transactions that will go unnoticed that will hurt the business.
From the three models, it can be said that the best model is the decision tree over the random forest and logistic regression models since the model's precision and recall are the highest.
Notes and References
1) Link to Full Python Code: https://github.com/Srikar-R/Decision-Tree/blob/main/Credit_Card_Detection_Srikar.ipynb
2) Libraries used:
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import scipy.stats as stats
from sklearn.feature_selection import RFE
Comments