top of page

Credit Card Fraud Detection





In recent times, the number of fraudulent transactions has increased drastically due to which credit card companies are facing a lot of challenges. For many banks, retaining highly profitable customers is the most important business goal. Banking fraud, however, poses a significant threat to this goal. In terms of substantial financial loss, trust, and credibility, banking fraud is a concerning issue for both banks and customers alike.




Problem Statement

Finex is also not really equipped with the latest financial technologies, and it is becoming difficult for the bank to track these data breaches on time to prevent further losses. The Branch Manager is worried about the ongoing situation and wants to identify the possible root causes and action areas to come up with a long-term solution that would help the bank generate high revenue with minimal losses.





Data Understanding

The data set contains credit card transactions of around 1,000 cardholders with a pool of 800 merchants from 1 Jan 2019 to 31 Dec 2020. It contains a total of 18,52,394 transactions, out of which 9,651 are fraudulent transactions. The data set is highly imbalanced, with the positive class (frauds) accounting for 0.52% of the total transactions. Since the data set is highly imbalanced, it needs to be handled before model building. The feature 'amt' represents the transaction amount. The attribute 'is_fraud' represents class labeling and takes the value 1 the transaction is a fraudulent transaction and 0, otherwise.




Exploratory Data Analysis

To understand the data better, we can check the variables that are measured in the dataset.


  • Overall, a small percentage of transactions are fraudulent (0.52%)


  • Most people who happen to transact are between the age group (40-60)


  • We can also observe a few outliers in the transaction but these are naturally occurring variations and hence nothing will be done




  • There is no high correlation between any of the variables with each other


  • The transactions in red are fraudulent and blue are non-fraudulent. We can observe that most fraudulent transactions happen between 12 AM to 5 AM or 9 PM to 12 AM.


All variables are useful to detect fraudulent transactions. For further analysis, click GitHub link:https://github.com/Srikar-R/Decision-Tree/blob/main/Credit_Card_Detection_Srikar.ipynb


Using these variables, we now build different models to predict fraudulent transactions


Model Building

There are three ways to build the model. These models have been chosen since the dependent variable is categorical


i) Logistic Regression

ii) Decision Tree

iii) random Forest


We will pick the model which provides the least number of errors which is measured using precision and recall


i) Logistic Regression


We split the dataset into train and test variable

X_train, X_test, y_train, y_test = train_test_split(oversampled[X],oversampled[y] , 
                                                    train_size=0.7, test_size=0.3, random_state=100)

Then we build the model

logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

We have precision and recall above 0.7 which is a good percentage of values. Now, we can test the same dataset with the decision tree


ii) Decision Tree


We get high precision and recall which are not good signs as it could be that model is overfitted. Lastly, we can check using the random forest model as well to test the dataset


dt_clf = DecisionTreeClassifier(criterion = 'gini', max_depth = 20, random_state=0)
dt_clf.fit(X_train, y_train)




iii) Random Forest

We can observe that the model built for random forests also shows high precision and recall


RandomForestClassifier(max_depth=20, n_estimators=50, random_state=345,verbose=1)



Now, all of them must be tested using the test dataset to check if there is no overfitting, and based on that, we can pick the best model



Model Evaluation

We can test the models based on the test data and see if they overfit or not


i) Logistic Regression





ii) Decision tree





iii) Random Forest



We can observe that the precision-recall values remain the same for test data as well


Conclusion

It is very important to have a high recall since you do not want to group fraudulent transactions as non-fraudulent. If there are fraudulent transactions being termed as non-fraudulent, then there are so many transactions that will go unnoticed that will hurt the business.




From the three models, it can be said that the best model is the decision tree over the random forest and logistic regression models since the model's precision and recall are the highest.


Notes and References



2) Libraries used:



import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import scipy.stats as stats
from sklearn.feature_selection import RFE 









Comments


bottom of page