
As the use of Blockchain for digital payments continues to rise in popularity, it also becomes susceptible to various malicious attacks. Successfully detecting anomalies within Blockchain transactions is essential for bolstering trust in digital payments. However, the task of anomaly detection in Blockchain transaction data is challenging due to the infrequent occurrence of illicit transactions. Although several studies have been conducted in the field, a limitation persists: the lack of explanations for the model’s predictions. This study seeks to overcome this limitation by integrating eXplainable Artificial Intelligence (XAI) techniques and anomaly rules into tree-based ensemble classifiers for detecting anomalous Bitcoin transactions. The Shapley Additive exPlanation (SHAP) method is employed to measure the contribution of each feature, and it is compatible with ensemble models. Moreover, we present rules for interpreting whether a Bitcoin transaction is anomalous or not. Additionally, we have introduced an under-sampling algorithm named XGBCLUS, designed to balance anomalous and non-anomalous transaction data. This algorithm is compared against other commonly used under-sampling and over-sampling techniques. Finally, the outcomes of various tree-based single classifiers are compared with those of stacking and voting ensemble classifiers. Our experimental results demonstrate that: (i) XGBCLUS enhances TPR and ROC-AUC scores compared to state-of-the-art under-sampling and over-sampling techniques, and (ii) our proposed ensemble classifiers outperform traditional single tree-based machine learning classifiers in terms of accuracy, TPR, and FPR scores.
- Introduction
Blockchain, a chain of blocks that contain the history of several transactions or records of other applications in a public ledger, has been considered an emerging technology both in academic and industrial areas since the last decade Nofer, Gomber, Hinz and Schiereck (2017). Bitcoin, the first digital cryptocurrency, was proposed in 2008 and then successfully implemented by Satoshi Nakamoto Nakamoto (2008). Although Blockchain was created to support the popular bitcoin currency, transactions of other digital crypto-currencies such as Ethereum, Ripple, Litecoin, etc., health records, transportation, IoT applications, etc. Yaga, Mell, Roby and Scarfone (2019) are stored in the blocks in a decentralized manner and are managed without the help of a third party organization. Prominent attributes like trustworthiness, verifiability, decentralization, and immutability have rendered Blockchain an effective integration partner for various Information and Communication Technology (ICT) applications.
Nevertheless, this technology remains susceptible to an array of challenges, encompassing security breaches, privacy concerns, energy consumption, regulatory policies, and issues like selfish mining Monrat, Schelén and Andersson (2019).
Despite its growing popularity in digital payments, Bitcoin remains susceptible to a range of attacks, encompassing temporal attacks, spatial attacks, and logical-partitioning attacks Saad, Cook, Nguyen, Thai and Mohaisen (2019).
To ensure the effective implementation of blockchain technology, the timely detection of malicious behavior or transactions within the network, or the identification of novel instances in the data, is imperative Hassan, Rehmani and Chen (2022). Swift and appropriate actions must be taken to mitigate potential risks. Within the context of blockchain systems, anomaly detection assumes paramount importance. This facet aids in the identification and prevention of potential malicious activities, thereby upholding the system’s integrity Signorini, Pontecorvi, Kanoun and Di Pietro (2018).
Nonetheless, the inherent imbalance between normal and anomalous data in blockchain datasets, such as those of Bitcoin or Ethereum transactions, presents a substantial challenge for conventional anomaly detection methodologies. In many instances, the frequency of anomalous data points significantly pales in comparison to that of normal data points, thereby yielding imbalanced datasets. Such skewed data distribution can exert an adverse impact on the efficacy of anomaly detection algorithms. These algorithms, often biased towards the majority class (normal data), grapple with difficulties in accurately discerning the minority class (anomalous data) Ashfaq, Khalid, Yahaya, Aslam, Azar, Alsafari and Hameed (2022).
Several Over and Under-sampling techniques such as Synthetic Minority OverSampling Technique (SMOTE), Adaptive Synthetic (ADASYN), Random Under Sampling…
(RUS), Near-Miss, etc. have been used to handle imbalanced data in various domains. Under-sampling techniques can alleviate the bias towards the majority class and enhance the performance of anomaly detection algorithms El Hajjami, Malki, Berrada and Fourka (2020). In this study, the anomalous Bitcoin transactions are much lower than the legal transactions. However, it is necessary to identify the illicit Bitcoin transactions as well as the normal transactions accurately. Since there are only 108 anomalous cases in this dataset, the under-sampling techniques will select the same number of non-anomalous cases. Also, the models will learn the positive samples that help to classify the anomalous transactions from the independent test set more correctly. Moreover, the over-sampling methods generate artificial synthetic data to equalize the minority and the majority samples Han, Woo and Hong (2020). The synthetic oversampled data minimizes the false positive rates in the test dataset, however, cannot perform well for highly imbalanced data. For this, variations of Generative Adversarial Network (GAN) based over-sampling techniques have been investigated by researchers for generating artificial data to balance the minority and majority class Ahsan, Shi, Ma and Lee Croft (2022). Although there are several under-sampling algorithms, no technique is free from limitations. A major problem in most of the under-sampling algorithms is that significant instances which have a great impact on model training may be missed Saripuddin, Suliman, Syarmila Sameon and Jorgensen (2021). Besides, overfitting problems, sensitivity to noises, introducing bias into the database, etc. can reduce the performance of the machine learning models Alsowail (2022). Still, there is no perfect under-sampling technique proposed as the dataset varies. Considering these issues, we have proposed an under-sampling technique based on the ensemble method. Our proposed algorithm selects a subset of instances from the majority class which has a significant impact on the performance of the machine learning models. Our algorithm gives importance to all the samples in the dataset by iteratively selecting the subsets and this minimizes the chance to miss the significant samples. Also, some combined balancing techniques have been investigated to compare which technique performs better in Bitcoin anomaly detection.
After balancing the data, selecting a machine learning classifier is another challenging task. Tree-based machine learning classifiers have been used in many studies to classify malicious activities Sarker (2023) since faster training can be performed on tree-based classifiers Rashid, Kamruzzaman, Imam, Wibowo and Gordon (2022). However, several studies show that the ensemble method can perform better in the case of large-scale data e.g. Bitcoin transactions than a single machine learning algorithm Zhou, Cheng, Jiang and Dai (2020). The idea behind the stacked ensemble method is that by combining multiple models, the strengths of each model can be leveraged while mitigating their weaknesses Xia, Chen and Yang (2021). The stacked ensemble method differs from other ensemble methods, such as bagging and boosting, in that it combines models with different algorithms and/or hyperparameters rather than replicating the same model multiple times. This allows for greater model diversity and can lead to better performance. On the other side, Voting ensemble models predict the output by using the highest majority voting (Hard Voting) or the highest average value of the individual prediction (Soft Voting) Yang, Lin, Wu and Wang (2021). Although the complexity and computational costs may increase due to training multiple models in the base classifier, there are some valid reasons for using the stacking ensemble method. Firstly, Tree-based models are computationally faster than other Machine Learning (ML) models like SVM or KNN. Secondly, tree-based models can handle the data without any conversion or normalization Pham, Foo, Suriadi, Jeffrey and Lahza (2018). Thirdly, A stacking-based ensemble model can increase performance by minimizing the prediction errors caused by the variance components. Moreover, the Voting Classifier with only tree-based models is a powerful and interpretable ensemble learning technique. Combining the predictions of different tree-based algorithms can provide more accurate and robust predictions, making it an attractive choice for various classification and regression tasks. Additionally, the interpretability of tree-based models remains preserved in the ensemble, which can be a valuable asset in applications where model transparency and understanding are essential. Considering the effectiveness of the combined tree-based methods, both stacked and voting ensemble models are proposed in this study.
After performing all these great tasks for anomaly detection, a question can arise “Should we trust the prediction of the Black-Box model?”. XAI, Explainable Artificial Intelligence, is a field of interest to find the answer to this question. This latest AI technique helps to increase the explainability and transparency of the black-box AI models by making complex interpretable decisions Ward, Wang, Lu, Bennamoun, Dwivedi and Sanfilippo (2021). Two popular XAI techniques e.g. ‘Local Interpretable Model-Agnostic Explanations’ (LIME) Ribeiro, Singh and Guestrin (2016) and ‘SHapely Additive exPlanations’ (SHAP) Lundberg and Lee (2017) have been used by researchers to prove the explainability and transparency of the AI models. SHAP has been investigated in our study since it can find interpretable decisions more quickly and accurately.
In this study, an eXtreme Gradient Boosting-based Clustering (XGBCLUS) under-sampling method has been proposed and compared to other state-of-the-art under-sampling techniques to detect Bitcoin anomalous transactions. Besides, Two popular over-sampling techniques e.g. SMOTE and ADASYN have been used to generate synthetic data points for balancing majority and minority classes and also compared to the combined sampling methods SMOTE with Edited Nearest Neighbor (ENN) and SMOTE with TOMEK Links. A comparative analysis has been performed among the down-sampling and over-sampling methods to decide which sampling technique is appropriate for identifying anomalous transactions. To classify non-illicit and illicit Bitcoin transactions, tree-based ML classifiers such as Extreme
Blockchain anomalous transactions detection
Gradient Boosting (XGB) Chen, He, Benesty, Khotilovich, Tang, Cho, Chen, Mitchell, Cano, Zhou et al. (2015), Random Forest (RF) Biau and Scornet (2016), Decision Tree (DT) Sharma, Kumar et al. (2016), Gradient Boosting (GB) Natekin and Knoll (2013), and Adaptive Boosting (AdB) Rojas et al. (2009) have been used. Although a single ML classifier may work well for anomaly detection on the training dataset, however, the final prediction on the independent test dataset may perform poorly. That’s why both stacking and voting ensemble models have been proposed to increase the accuracy by combining the outcomes from individual classifiers. The main reason behind this is if one of the chosen ML classifiers in the ensemble does not perform well, the risk can be minimized by averaging the outputs of all of them. Cross-validation has been used to avoid overfitting where the beginning training dataset is used to create multiple mini-train-test splits. 10-fold cross-validation has been used by partitioning the data into 10 subsets known as the fold. Then, iteratively the algorithm has been trained on 9 folds while the remaining fold is kept as the test set. Finally, the correctness of the prediction by the Black-Box ML models has been proved by the XAI technique i.e. using SHAP analysis. A set of rules is also presented to conduct interpretability analysis for determining whether a Bitcoin transaction is anomalous or not. The contributions of this study are summarised below.
- We introduce an under-sampling algorithm based on eXtreme Gradient Boosting (XGBoost) called XGB-CLUS, and we compare it with state-of-the-art methods.
- We also explore various over-sampling and combined sampling techniques for classifying Bitcoin transactions.
- Further, we compare the effectiveness of both under-sampling and over-sampling techniques, and we also compare the tree-based ensemble classifiers with the individual ML classifiers for anomaly detection.
- We explain the predictions of the ensemble models using SHAP (an eXplainable Artificial Intelligence technique) and identify the crucial features that exert the most influence on classifying Bitcoin transactions.
- Lastly, we present a set of rules derived from a tree-based model to conduct interpretability analysis for anomalous transactions.
The rest of the paper is structured as follows: Section 2 provides a summary of recent research involving both supervised and unsupervised machine learning techniques. In Section 3, we outline the specifics of the proposed methods. Comparative results are presented in Section 4. Section 5 contains a discussion, and, lastly, Section 6 concludes the research paper.
- Related Work
Researchers have paid attention to detecting or predicting whether a transaction is anomalous or not using the concept of blockchain intelligence Zheng, Dai and Wu (2019), i.e., introducing Artificial Intelligence (AI) for anomaly detection or fraud detection in blockchain transaction data. Several supervised and unsupervised techniques along with various balancing methods have been applied to detect fraudulent transactions in blockchain networks.
2.1. Supervised Techniques
Chen et al. Chen, Wei and Gu (2021) employed supervised machine learning classifiers, including Random Forest (RF), Adaptive Boosting, MLP, SVM, and KNN, to detect bitcoin theft. RF exhibited the best performance with an F1 value of 0.952, surpassing other unsupervised algorithms. Another study Yin and Vatrapu (2017) explored the Bitcoin ecosystem, categorizing illegal activities and using Bagging and Gradient Boosting for classification. While visualizing results, a notable limitation was the absence of proper balancing techniques for model tuning. Singh et al. Singh (2019) utilized SVM, Decision Tree, and Random Forest for Ethereum network anomaly detection, detecting anomalous transactions with some limitations in dataset coverage. Active learning tools were applied in a study Lorenz, Silva, Aparício, Ascensão and Bizarro (2020) for illegal activity detection in Bitcoin transactions, claiming superiority over unsupervised methods. A comparative study Alarab, Prakoonwit and Nacer (2020) favored ensemble-based methods for classifying non-anomalous and anomalous Bitcoin transactions based on accuracy and F1-score metrics.
2.2. Unsupervised Techniques
In their analysis Pham and Lee (2016), the authors used Bitcoin transaction data to create two graphs for users and transactions, aiming to detect anomalies. Employing unsupervised methods such as SVM, K-means clustering, and Mahalanobis distance, they identified two anomalous users and one anomalous transaction out of 30 cases. However, the method struggled to identify maximum positive cases. Another study Sayadi, Rejeb and Choukair (2019a) utilized the K-means algorithm for anomaly detection in blockchain electronic transactions, also incorporating OSVM to find outliers. This approach suffered from high false positive rates. In a comparative study Arya, Harika, Rahul, Narasimhan and Ashok (2021), various unsupervised learning algorithms, including IForest, One Class SVM, Two Phase Clustering, and Multivariate Gaussian, were evaluated, with the Multivariate Gaussian algorithm showing the highest F1-Score. Evaluating the trimmed K-Means clustering algorithm, Monamo et al. Monamo, Marivate and Twala (2016) successfully identified 5 anomalous activities out of 30 cases related to illicit transactions in the Bitcoin network. An encoder-decoder-based deep learning model was designed to detect anomalous activities in the Ethereum transaction network, claiming to detect illicit activities in the Ethereum network for the first time Scicchitano, Liguori, Guarascio, Ritacco and Manco (2020). Lastly, clustering and role detection methods were applied in a study by Hirshman, Huang and Macke (2013) to identify suspicious users in Bitcoin transaction data, using K-means for clustering and RoIX for role detection.
2.3. Balancing Techniques
Researchers, as highlighted in Li, Cai, Tian, Xue and Zheng (2020), have proposed various undersampling and oversampling methods to enhance evaluation metrics. For instance, in a comparative study by Alarab and Prakoonwit (2022), customized nearest-neighbor undersampling achieved 99% accuracy, outperforming various SMOTE-based oversampling techniques in analyzing Bitcoin and Ethereum transaction data. In the realm of bank transactions, ensemble-based classifiers, particularly SVM SMOTE dataset balancing with the Random Forest classifier, were found effective in detecting fraudulent activities by Taneja, Suri and Kothari (2019). Similarly, authors in Ahmad, Kasasbeh, Aldabaybah and Rawashdeh (2023) proposed an undersampling technique using fuzzy C-means clustering and similarity checks for identifying illicit activities in credit card transactions. Additionally, SMOTE oversampling has been employed in studies such as Prasetiyo, Muslim, Baroroh et al. (2021) and Yang, Zhang, Ye, Li and Xu (2019) to classify illegal activities in credit card transactions, while various undersampling techniques were investigated in Itoo and Singh (2021) and Xuan, Liu, Li, Zheng, Wang and Jiang (2018) to discern normal and illegal activities. Notably, for handling highly imbalanced datasets, Ahmed, Hasan, Hossain and Anderson (2022) explored several undersampling techniques for early product back-order prediction.
- Methodology
The overall methodology of this study is illustrated in Figure 1. Initially, we collected the dataset and conducted necessary preprocessing. Subsequently, the dataset was divided into training and testing data. Data sampling was exclusively applied to the training data, while the test data remained independent. The sampled data was employed to train both individual and ensemble machine learning classifiers. Finally, the independent test data was utilized to validate the models, employing various evaluation metrics including Accuracy, TPR, FPR, and ROC-AUC score. Additionally, we conducted a comparative analysis, combined with eXplainable Artificial Intelligence (XAI) and rules from Decision Tree, which is discussed in Section 4.
3.1. Dataset
The bitcoin transaction data has been collected from the IEEE Data Portal. It contains a total of 30,248,134 samples where 30,248,026 has been labeled as negative samples i.e. the non-malicious transactions. On the other hand, there are only 108 malicious samples. So it is clear that the dataset is highly imbalanced. Exploratory Data Analysis (EDA) has been performed to reveal the dataset insights. There exist 12 attributes with a label to indicate whether a Bitcoin transaction is anomalous or not. The anomalous transaction is labeled by 1 and 0 stands for non-anomalous transactions. A correlation matrix of the features is depicted in Figure 2. Features such as ( \text{in_btc} ), ( \text{out_btc} ), ( \text{total_btc} ), ( \text{mean_in_btc} ), and ( \text{mean_out_btc} ) exhibit strong positive correlations. Conversely, the ( \text{indegree} ) and ( \text{outdegree} ) features display weak correlation. While ( \text{in_malicious} ), ( \text{out_malicious} ), ( \text{is_malicious} ), and ( \text{all_malicious} ) features are notably correlated with the output feature ( \text{out_and_tx_malicious} ), no substantial correlation is observed between these features and ( \text{indegree} ), ( \text{outdegree} ), ( \text{in_btc} ), ( \text{out_btc} ), ( \text{total_btc} ), ( \text{mean_in_btc} ), and ( \text{mean_out_btc} ).
Furthermore, we conducted a hypothesis test for feature selection employing the T-test to examine correlations between positive (anomalous Bitcoin transactions) and negative (non-anomalous Bitcoin transactions) samples. The T-test determines whether a significant difference exists between the means of the positive and negative samples. T-statistic values and corresponding p-values are computed for each attribute. These values are presented in Table 1. Notably, all attributes demonstrate significance with p-values below 0.01, except for ( \text{outdegree} ). Additionally, it’s worth noting that all features (( \text{in_malicious} ), ( \text{out_malicious} ), ( \text{is_malicious} ), and ( \text{all_malicious} )) share the same value range as the target feature ( \text{out_and_tx_malicious} ). Consequently, this similarity poses a challenge for ML classifiers in effectively discerning Bitcoin transactions, leading to the exclusion of these four features and ( \text{outdegree} ). Ultimately, a set of seven features, including the target feature, is selected for classification. A summary of the selected features is shown in Table 2.
Following the feature selection process, the negative and positive samples were segregated, and duplicate entries were eliminated exclusively from the negative samples. To mitigate computational complexity, we opted to retain only 200,000 negative samples, accompanied by 108 positive samples. Nevertheless, the resulting imbalance ratio remains considerably high as shown in Figure 3.
Attribute | t value | p value |
---|---|---|
indegree | -14.013838 | 0.000000 |
outdegree | 0.842249 | 0.399648 |
in_btc | -17.229753 | 0.000000 |
out_btc | -16.469202 | 0.000000 |
total_btc | -16.864202 | 0.000000 |
mean_in_btc | -8.727102 | 0.000000 |
mean_out_btc | -16.014732 | 0.000000 |
in_malicious | -68.869826 | 0.000000 |
is_malicious | -3878.622465 | 0.000000 |
all_malicious | -1866.899584 | 0.000000 |
Table 2 Summary of the selected features
Feature Name | Description |
---|---|
Indegree | No. of inputs for a given transaction |
in_btc | No. of bitcoins on each incoming edge to a given transaction |
out_btc | No. of bitcoins on each outgoing edge from a given transaction |
total_btc | Total number of bitcoins for a given transaction |
mean_in_btc | Average number of bitcoins on each incoming edge to a given transaction |
mean_out_btc | Average number of bitcoins on each outgoing edge from a given transaction |
out_and_tx_malicious | Status of a given transaction if it is malicious or not |
3.2. Imbalanced Data Handling
In Bitcoin transactions, the number of illicit transactions is significantly lower than that of normal transactions, leading to an imbalanced dataset. Consequently, machine learning classifiers tend to exhibit bias toward the majority class Rout, Mishra and Mallick (2018). While classification accuracy might appear satisfactory in many cases, a notable discrepancy between TPR and FPR values often arises—indicating that the models struggle to accurately classify anomalies. To address this, it is crucial to balance the dataset using built-in or customized sampling techniques prior to training the classification models. In scenarios involving anomaly detection, fraud identification, or money laundering, positive cases are typically scarce. As such, undersampling techniques can prove effective in rebalancing the dataset while prioritizing accurate identification of positive cases. However, in instances where the number of positive cases or anomalies in the minority class is exceedingly low, ML classifiers may be trained on a limited dataset generated by the undersampling technique. On the other hand, over-sampling methods aim to increase the instance count of the minority class to match that of the majority class. Despite generating artificial data based on a combination of majority and minority samples, these methods can be effective. In our study, we introduce an under-sampling algorithm named XGBCLUS, and we also investigate established under-sampling techniques such as Random Under Sampling (RUS) and Near-Miss. Furthermore, we explore popular over-sampling techniques including SMOTE, ADASYN, as well as combined approaches like SMOTEENN and SMOTETOMEK. We present a comparison between over-sampling and under-sampling methods in Section 4.
3.2.1. Under-Sampling Techniques
We have investigated two under-sampling methods e.g. Random Under Sampling and Near-Miss along with our proposed under-sampling method which is described in Section 3.2.2. Random Under Sampling (RUS) is a simple technique used to handle class imbalance in datasets. It randomly selects a subset of instances from the majority class. The size of this subset is determined based on the desired balance ratio between the minority and majority classes. The balance ratio $\alpha_{us}$ is defined by Equation 1.
$$\alpha_{us} = \frac{N_m}{N_{rm}}$$ \hspace{1cm} (1)
where $N_m$ is the number of samples in the minority class and $N_{rm}$ is the number of samples in the majority class after resampling. Then, the instances from both the minority class and the randomly selected subset of the majority class instances are combined to form a balanced dataset. Another under-sampling technique namely Near-Miss is also used for balancing the dataset. It reduces the imbalance by retaining a subset of instances from the majority class that are close to instances from the minority class. This down-sampling technique selects instances based on their proximity to the minority class, making it possible to preserve important samples. For each instance in the minority class, Near-Miss calculates its distances to all instances in the majority class using various distance metrics such as Euclidean distance or Manhattan distance. After that, this algorithm identifies the ( k ) instances from the majority class that are closest to each instance in the minority class. The value of ( k ) is typically set as a hyperparameter and determines the degree of undersampling. We set the value as 1 and hence we call it Near-Miss-1. Finally, it combines the instances from both the minority class and the selected instances from the majority class to form an under-sampled dataset.
3.2.2. XGBCLUS Algorithm
XGBCLUS (eXtreme Gradient Boosting-based Clustering) operates by merging clusters (accomplished by selecting random instances from the majority class in the training data, equal in number to the positive instances from the minority class) and the Extreme gradient Boosting algorithm. The algorithm starts with splitting the whole dataset into
Algorithm 1 XGBCLUS algorithm
- Input: Imbalanced Training samples, ( DATA ); Number of iterations, ( k ); Number of positive samples, ( P ); The independent test data and The XGBoost algorithm.
- Output: Selected Under-Sampled data
- Initialize ( T_{MAX} ) and ( F_{MIN} )
- Initialize an empty set ( Selected_Samples )
- for ( i = 1 ) to ( k ) do
- Select ( n ) negative samples arbitrary equal to ( P ) and Prepare the Train data
- Train the model and predict using the test samples
- Calculate True Positive (TP) and False Positive (FP) values
- if ( TP > T_{MAX} ) and ( FP < F_{MIN} ) then
- Set ( T_{MAX} = TP ) and ( F_{MIN} = FP )
- Update current ( n ) samples in ( Selected_Samples )
- end if
- end for
- if ( Selected_Samples ) is empty then
- Goto step 3 and repeat the steps 3 – 13 after changing ( T_{MAX} ) and ( F_{MIN} ) values
- else
- Return ( Selected_Samples )
- end if
In each iteration, the algorithm arbitrarily selects ( n ) negative samples equal to ( P ) and a new training set is prepared to train the Xgboost model. Using the independent test set, the model predicts the True Positive (TP) and False Positive (FP) values. The values of TP and FP values are compared with ( T_{MAX} ) and ( F_{MIN} ), respectively. ( T_{MAX} ) and ( F_{MIN} ) are initialized with arbitrary values where the ( T_{MAX} ) represents the maximum true positive value and the minimum false positive value is defined by ( F_{MIN} ). If the TP value is greater than the ( T_{MAX} ) value and the value of FP is less than the ( F_{MIN} ) value, then both ( T_{MAX} ) and ( F_{MIN} ) are updated with the TP and FP values respectively. At the same time, current ( n ) samples are updated in the ( Selected_Samples ) set. Otherwise, no changes are made in the current iteration.
After the ( k ) iterations are finished, the ( Selected_Samples ) set is checked. If the set is empty, the algorithm should be run again with new ( T_{MAX} ) and ( F_{MIN} ) values. Otherwise, the samples in the ( Selected_Samples ) set are the under-sampled data returned by the algorithm. The XGBCLUS algorithm is shown in Algorithm 1.
3.2.3. Over-Sampling Techniques
We have also investigated two popular over-sampling and two combined sampling strategies for handling the class imbalance in Bitcoin transaction data. Among them, Synthetic Minority Over-sampling Technique (SMOTE) is a widely used technique to handle the class imbalance problem, especially in Bitcoin Transactions for detecting anomalies. It aims to balance the class distribution by generating synthetic examples of the minority class, thereby mitigating the bias and improving model generalization. For each minority sample ( X_i ), it selects the ( k ) nearest neighbors randomly from the same class. A new sample ( X_n ) is generated using one of the nearest neighbors ( X_{z_i} ) from ( k ) and the new sample is generated using the equation 2.
[ X_n = X_i + \lambda \ast (X_{z_i} – X_i) ]
where ( \lambda ) is a random number between 0 and 1. Thus, a new synthetic instance is generated in the feature space. This process is repeated until the samples in both majority and minority classes are the same. At last, the original minority instances are combined with the newly generated synthetic instances to form a balanced dataset.
ADASYN (Adaptive Synthetic Sampling) also uses the same formula for generating the new sample except for the selection of ( X_i ). It increases the density of synthetic instances in the regions that are harder to classify which provides a more refined way of handling class imbalance. For each minority instance, it first calculates the number of its ( k ) nearest neighbors that belong to the majority class to get an idea of how close the minority sample is to the majority class. Then, it computes the imbalance ratio ( \alpha_{os} ) for each minority instance using the equation 3.
[ \alpha_{os} = \frac{N_{rm}}{N_m} ]
where ( N_{rm} ) is the number of samples in the minority class after resampling and ( N_m ) is the number of samples in the majority class. This imbalance ratio is used to calculate the desired number of synthetic instances to be generated for the current minority instance. A difficulty ratio is also calculated to discover the hard level. If the ratio is high, it considers more neighbors for generating synthetic instances. By interpolating feature values between the minority instance and its selected neighbors, a new synthetic instance is generated. This process is repeated for all minority instances until the samples in both majority and minority classes are the same. Finally, the original minority instances are combined with the newly generated synthetic instances to form a balanced dataset.
SMOTEENN is a combination of two resampling techniques, SMOTE (Synthetic Minority Over-sampling Technique) and Edited Nearest Neighbors (ENN). It tackles class imbalance by creating synthetic samples using SMOTE and subsequently refining the dataset using ENN to eliminate potentially noisy or misclassified instances. SMOTEENN aims to provide a more refined approach to balancing imbalanced datasets while also improving the quality of the final dataset by eliminating potential noise. After generating the synthetic instances by SMOTE, for each instance in the dataset, ENN first finds the k nearest neighbors. If the instance’s class is different from the majority class of its neighbors, it removes the instance from the dataset. Thus, ENN eliminates noisy or misclassified instances. Another combined resampling technique is SMOTETOMEK where synthetic instances are generated by SMOTE and the under-sampling technique TOMEK Link identifies pairs of instances from different classes that are closest to each other. For each pair of instances identified as Tomek links, it removes the instance from the majority class. This undersampling method helps in removing instances that are close to the decision boundary and might be misclassified. This can lead to a more balanced, discriminative, and effective dataset for training machine learning models.
3.3. Proposed Ensemble Model
The meta-classification ensemble method based on stacked generalization is a Machine Learning (ML) approach used to improve the accuracy of predictions by combining multiple models Rajagopal, Kundapur and Hareesha (2020). The stacking-based ensemble model is formed by two classifiers. One is the base classifier and the other is the meta classifier. It starts with training a set of base models using several classifiers. A new dataset is found from the base-level classifiers and then the meta-classifier, also known as a combiner or a blender, is trained using the new dataset. After that, the learned meta-classifier is used to predict the independent test dataset. The architecture of the proposed stacked-ensemble model is shown in Figure 4.
In our proposed stacking-based ensemble model, Random Forest (RF), Decision Tree (DT), Gradient Boosting (GB), and Adaptive Boosting (AdB) have been used as the base models. The training dataset is used to train the base models and the outputs of the four base models along with the validation fold are combined to create a new dataset. Then Logistic regression (LR) King (2008), which is the meta-classifier in our proposed model, receives the newly formed dataset by combining the predictions of the base classifiers as input and learns on that input set. Finally, the test dataset has been used to predict anomalous and non-anomalous transactions.
On the other side, the Voting Classifier is constructed using only tree-based models, which are a family of machine learning algorithms known for their robustness and interpretability. The architecture of a Voting Classifier that incorporates tree-based models like Decision Trees (DT), XGBoost (XGB), Gradient Boosting (GB), Random Forest (RF), and AdaBoost (ADB) is shown in Figure 5. The Voting Classifier takes the training set as input and the ensemble is constructed by combining the predictions of several individual tree-based models. Each model in this context refers to a unique instantiation of the tree-based algorithm with a specific set of hyperparameters or configurations. After the Voting Classifier has been trained and evaluated, it is used to make predictions on test data. The Voting Classifier aggregates the predictions of each individual tree-based model using a voting mechanism. The voting can be either “hard” or “soft”. In hard voting, each model in the ensemble casts a single vote for the predicted class label, and the majority class receives the final prediction. For soft voting, the probabilities (confidence scores) of each model’s predicted classes are averaged, and the class with the highest average probability is chosen as the final prediction.
- Results Analysis
In this section, we assess the performance of the proposed ensemble models using both under-sampled and over-sampled data. Additionally, a comprehensive comparative analysis between single classifiers and ensemble classifiers is presented herein. We commence by configuring the experimental environment and subsequently present outputs that scrutinize various facets of the model’s performance concerning Bitcoin anomaly detection. Thorough performance analyses have been conducted on a dataset encompassing both normal and anomaly transactions within the realm of Bitcoin transactions. An overview of the dataset is provided in Section 3. To facilitate performance evaluation, several Python libraries such as Pandas, NumPy, and scikit-learn were employed. The Python script was executed on Colab Pro, utilizing 12.68 GB of RAM and 225.83 GB of disk space.
4.1. Evaluation Metrics
Given that accuracy alone is insufficient to gauge the performance of an anomaly detection system, it becomes crucial to employ additional metrics such as True Positive Rate (TPR) to assess the accurate identification of anomalous transactions and False Positive Rate (FPR) to evaluate the correct identification of non-anomalous transactions. This aligns with the primary objective of our study. The evaluation metrics e.g. accuracy, True Positive Rate (TPR), and False Positive Rate (FPR) have been used to compare the performance of the single models without and with balancing the data against the proposed ensemble models. Additionally, we also use the feature importance score to show the hierarchy of the features. We have also considered the Receiver Operating Characteristic (ROC) score, which compares the True Positive Rate (TPR) against the False Positive Rate (FPR). The performance metrics are defined below:
( TP ) = True Positive: an anomalous transaction is correctly identified as anomalous
( TN ) = True Negative: a non-anomalous or normal transaction is correctly identified as non-anomalous
( FP ) = False Positive: a non-anomalous transaction is incorrectly identified as anomalous
( FN ) = False Negative: an anomalous transaction is incorrectly identified as non-anomalous
[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} ]
(4)
[ TPR = \text{Sensitivity} = \frac{TP}{TP + FN} ]
(5)
[ TNR = \text{Specificity} = \frac{TN}{TN + FP} ]
(6)
[ FPR = \frac{FP}{TN + FP} ]
(7)
AUC is calculated as the Area Under the ( Sensitivity(TPR) ) – ( (1 – Specificity)(FPR) ) Curve.
4.2. Effects of Under-Sampling in Classification
We have kept 20% data for the independent test set and the remaining 80% has been used for training the models.
prove the data imbalanced problem, the classifiers have been trained without balancing the train set. The ML classifiers become biased to the majority samples and result in a high true negative value. However, the ML classifiers cannot identify the positive samples correctly that’s why the true positive rate is very low and in some cases, it is zero. Table 3 shows the comparison of accuracy, True Positive (TP), and roc-auc score of the Decision Tree (DT), Gradient Boosting (GBoost), Random Forest (RF), and Adaptive Boosting (AdaBoost) classifiers. The TP values are zero for DT and RF classifiers which indicates that no anomalous transactions are correctly identified and all transactions are classified as normal transactions. Although the accuracy seems to be good enough for the classifiers, the TP score tends to zero i.e. anomalous transactions are not identified because of the biases of models to the majority of transactions. Given that detecting anomalous transactions is the primary objective of our study, classifiers may struggle to identify those transactions without balanced data. Therefore, we explored several balancing techniques to enhance the True Positive Rate (TPR) and decrease the False Positive Rate (FPR) values.
In Figure 6, the confusion matrices illustrate the performance of the ensemble classifier under different under-sampling methods, namely Random Under Sampling (RUS), Nearmiss1, and XGBClus. Notably, the True Positive (TP) values exhibit an increase compared to scenarios without balancing, where TP values are consistently zero. However, it is essential to discern that, despite the improvements in TP, the False Positive (FP) values show variations among the under-sampling methods. Specifically, Nearmiss1 displays a relatively higher FP count compared to RUS and XGBClus, even though the FP is zero in the absence of balancing. The XGBClus undersampling method proposed in our study outperforms the existing method by achieving the highest True Positive (TP) value along with relatively low False Positive (FP) values. This superiority stems from our algorithm’s approach of considering all instances in downsampling, whereas existing algorithms randomly select instances, leading to the omission of important cases.
Given that the TPR or sensitivity signifies the count of correctly classified positive transactions, down-sampling techniques were explored to equalize the numbers of normal and anomalous transactions. We employed XGBCLUS, our proposed under-sampling method, in conjunction with other established techniques for downsampling. Figure 7 illustrates that the sensitivity values for all single and ensemble ML classifiers witnessed an increase, except for the NearMiss-1 under-sampling algorithm. Notably, the sensitivity values for both single and ensemble classifiers utilizing the XGBCLUS algorithm stand at 0.82, 0.86, 0.86, 0.81, 0.86, 0.81, and 0.91, respectively. These values exceed the sensitivity values obtained without balancing and those from random under-sampling techniques.
While the NearMiss-1 undersampling technique yields a higher sensitivity value compared to the XGBCLUS method, the corresponding FPR value is markedly high, as demonstrated in Table 4. As the False Positive Rate (FPR) decreases, the True Negative Rate (TNR) increases, indicating the correct identification of non-anomalous transactions. The NearMiss-1 undersampling technique exhibits a higher FPR, suggesting its limitation in accurately identifying non-anomalous transactions. In contrast, the random undersampling method produces average FPR values, although they are higher than the FPR values of XGBClus. In terms of TPR and FPR values, XGBClus outperforms other under-sampling techniques.
Figure 8 illustrates the enhanced ROC-AUC scores obtained through the utilization of under-sampled data. Notably, the proposed XGBCLUS undersampling technique outperforms RUS and Near-Miss in terms of ROC-AUC scores. The Gradient Boosting (GBoost) classifier achieves the highest ROC-AUC score of 0.92, which stands as the peak among all single and ensemble classifiers. Furthermore, the remaining classifiers achieve ROC-AUC scores ranging from 0.85 to 0.91. This range indicates that the true positive and false positive rates adhere to their expected levels.
4.3. Effects of Over-Sampling in Classification
Although the TPR scores have increased for all ML classifiers following the under-sampling of data, the FPR values remain unsatisfactory. The objective is to elevate the True Positive (TP) rate while reducing the FP rate. To achieve this balance, various over-sampling and combined methods have been applied to balance the training data. In Figure 9, the confusion matrices depict the performance of the ensemble classifier when employing various over-sampling methods, including SMOTE, ADASYN, SMOTEENN, and SMOTETOMEK. Notably, the True Positive (TP) values exhibit a decrease compared to the results obtained with under-sampling techniques. Despite this reduction in TP values, it is crucial to observe that the False Positive (FP).
Useful information for enthusiasts:
- [1]YouTube Channel CryptoDeepTech
- [2]Telegram Channel CryptoDeepTech
- [3]GitHub Repositories CryptoDeepTools
- [4]Telegram: ExploitDarlenePRO
- [5]YouTube Channel ExploitDarlenePRO
- [6]GitHub Repositories Keyhunters
- [7]Telegram: Bitcoin ChatGPT
- [8]YouTube Channel BitcoinChatGPT
- [9] Bitcoin Core Wallet Vulnerability
- [10] BTC PAYS DOCKEYHUNT
- [11] DOCKEYHUNT
- [12]Telegram: DocKeyHunt
- [13]ExploitDarlenePRO.com
- [14]DUST ATTACK
- [15]Vulnerable Bitcoin Wallets
- [16] ATTACKSAFE SOFTWARE
- [17] LATTICE ATTACK
- [18] RangeNonce
- [19] BitcoinWhosWho
- [20] Bitcoin Wallet by Coinbin
- [21] POLYNONCE ATTACK
- [22] Cold Wallet Vulnerability
- [23] Trezor Hardware Wallet Vulnerability
- [24] Exodus Wallet Vulnerability
- [25] BITCOIN DOCKEYHUNT
Contact me via Telegram: @ExploitDarlenePRO