Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network for Financial Forensics

05.03.2025
Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network for Financial Forensics

Blockchain provides the unique and accountable channel for financial forensics by mining its open and immutable transaction data. A recent surge has been witnessed by training machine learning models with cryptocurrency transaction data for anomaly detection, such as money laundering and other fraudulent activities. This paper presents a holistic applied data science approach to fraud detection in the Bitcoin network with two original contributions. First, we contribute the Elliptic++ dataset, which extends the Elliptic transaction dataset to include over 822k Bitcoin wallet addresses (nodes), each with 56 features, and 1.27M temporal interactions. This enables both the detection of fraudulent transactions and the detection of illicit addresses (actors) in the Bitcoin network by leveraging four types of graph data: (i) the transaction-to-transaction graph, representing the money flow in the Bitcoin network, (ii) the address-to-address interaction graph, capturing the types of transaction flows between Bitcoin addresses, (iii) the address-transaction graph, representing the bi-directional money flow between addresses and transactions (BTC flow from input address to one or more transactions and BTC flow from a transaction to one or more output addresses), and (iv) the user entity graph, capturing clusters of Bitcoin addresses representing unique Bitcoin users. Second, we perform fraud detection tasks on all four graphs by using diverse machine learning algorithms. We show that adding enhanced features from the address-to-address and the address-transaction graphs not only assists in effectively detecting both illicit transactions and illicit addresses, but also assists in gaining in-depth understanding of the root cause of money laundering vulnerabilities in cryptocurrency transactions and the strategies for fraud detection and prevention. The Elliptic++ dataset is released at https://www.github.com/git-disl/EllipticPlusPlus.

CCS CONCEPTS * Security and privacy → Intrusion/anomaly detection and malware mitigation; • Applied computing → Network forensics; • Computing methodologies → Machine learning.

KEYWORDS Blockchain, Anomaly Detection, Financial Forensics

ACM Reference Format: Youssef Elmougy and Ling Liu. 2023. Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network for Financial Forensics. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long Beach, CA, USA. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3580305.3599803

1 INTRODUCTION The emergence of cryptocurrency, initiated by Bitcoin [29], has kindled an adoption of a new exchange system in which financial transactions can be completed without an intermediary. This popularity has been perceived as a catalyst for the integration of cryptocurrencies into the payment systems of traditional financial institutions, enabling processing of exchanges between fiat money and cryptocurrency. To allow for immense volumes of such exchanges, financial institutions must provide rigid security against high risk transactions from fraudulent activities and assure compliance with anti-money laundering (AML) regulations. Although machine learning (ML) models have become a popular tool for anomaly detection and risk assessment of cryptocurrency transactions in recent years, we argue that fraud detection models trained for financial forensics should revolve around two goals: (i) prioritizing higher recall with a (reasonable) trade-off on precision, since the penalty of classifying an illicit transaction/account as licit far outweighs that of the reverse, and (ii) allowing explainability (and derivation of proof) on the risk of transactions/accounts.

Bitcoin [29] is the most widely used cryptocurrency. Analyzing its blockchain will provide a basis of reference. The Elliptic dataset [40] is the largest labelled Bitcoin transaction data publicly available. The dataset consists of over 203k transactions labelled illicit, licit, and unknown. It provides a good entry into fraudulent transaction analysis. However, the Elliptic dataset [40] consists of only Bitcoin transactions, without features of the addresses involved and the different interactions between pairs of addresses. Hence, the forensic analysis using the Elliptic dataset suffers from a prominent downfall: when a model predicts an illicit transaction, the addresses responsible cannot be clearly identified since a transaction may be associated with several input and output addresses.

The first contribution of this paper is the Elliptic++ dataset, which extends the Elliptic dataset to include all the Bitcoin wallet addresses (actors) and their temporal interactions associated to the transactions in the Elliptic dataset. Hence, forensics performed using the Elliptic++ dataset can identify fraudulent transactions and dishonest actors in the Bitcoin network and will also allow explainability as to why a wallet address associated with a transaction is illusive and dishonest. The construction of the Elliptic++ dataset contributes a novel way in collecting and visualizing Blockchain data, using wallet addresses as the center of a risk detection model. We collect Blockchain data associated with the Elliptic dataset (transactions). We first augment each transaction in the transactions dataset with…

additional features, then we crawl the Blockchain to create the Elliptic++ dataset, consisting of the feature-enhanced transactions dataset and the actors (wallet addresses) dataset. The actors dataset includes over 822k labelled wallet addresses, each described with 56 features, and over 1.27M temporal occurrences (interactions) across the same time steps (as those recorded in the Elliptic dataset). With our Elliptic++ dataset, one can perform anomaly detection of fraudulent activities, such as illicit transactions and fraudulent actors. We also include four types of graphs: the Money Flow Transaction Graph, the Actor Interaction Graph, the Address-Transaction Graph, and the User Entity Graph. These graphs contribute to both mining and visualization of the connections of transactions and the interactions among wallet addresses through their transactions.

The second contribution of the paper is performing fraud detection using the Elliptic++ dataset and the four graph representations by combining diverse ML algorithms and feature optimizations. We observe that Random Forest (RF) with feature refinement offers the best-performing model: on the transactions dataset, it achieves 98.6% precision and 72.7% recall compared to 97.5% precision and 71.9% recall when using RF without feature refinement; on the actors dataset, RF with feature refinement achieves 92.1% precision and 80.2% recall, compared to 91.1% precision and 78.9% recall when using RF without feature refinement. Furthermore, the fraud detection using the Elliptic++ dataset allows for in-depth understanding of the root cause of fraudulent activities in cryptocurrency transactions through semantic and statistical explainability, shining light on the strategies for fraud detection and prevention.

2 RELATED WORK

Since the introduction of the Elliptic dataset in 2019 by Weber et al. [40], there have been numerous efforts on data labelling and anomaly detection using ML models. Lorenz et al. [25] assumed minimal access to labels, and proposes an active learning solution by leveraging a smaller subset of the available labels. Oliveira et al. [30] proposed GuiltyWalker, a detection method that computes new features based on the structure of a transaction-to-transation graph and the distance to known illicit transactions. Alarab et al. [3] presented a comparative analysis of the Elliptic dataset using different supervised learning methods. Loa et al. [24] proposed Inspection-L, a graph neural network framework for anomaly detection, and similarly, Alarab et al. [2] proposed a graph-based LSTM with a graph convolutional network to detect illicit transactions.

More generally, there is increasing interest in AML in the context of financial transactions and cryptocurrency networks, such as Bitcoin or Ethereum. Combinatorial optimization methods for identifying transactions from graphs, including statistical deviation and dense subgraph discovery methods, have been explored in [11, 38]. The lack of labelled data and the imbalanced data are two major challenges for fraud detection when using supervised learning [6, 17, 28] or unsupervised learning [14, 27, 28, 31, 32]. Some efforts have also focused on de-anonymizing mixed transactions by Bitcoin-dedicated graph-based approaches [7, 8, 19, 41]. Regarding Bitcoin account profiling, Michalski et al. [33] built a small dataset of 9,000 addresses and applied supervised learning to characterize nodes in the Blockchain as miner or exchange, indicating the need of countermeasures for preserving the desired level of anonymity.

To the best of our knowledge, our Elliptic++ dataset is the largest public Bitcoin account dataset with 822,942 addresses.

Existing works on Ethereum account profiling and phishing account detection exclusively focus on graph representation learning methods [36, 37]. For Ethereum account profiling, [35] and [46] applied either Graph Convolution Network (GCN) [18] or a hierarchical graph attention encoder (HGATE) to infer the identity of Ethereum accounts based on learning latent features using node-embedding and subgraph-embeddings. TTAGNN [21] generate node embedding by combining LSTM encoder and Graph Attention Network (GAT) [39]. For detecting phishing accounts, Trans2Vec [42] and its variants [22, 23] utilize a graph based random walk algorithm on transaction time and amount information. The cascade method [13] leverages statistical feature extraction and a lightGBM-based ensemble algorithm. It is worth to note that given the difference in fundamental settings (UTXO and account models), it is not straightforward to apply the existing graph representation learning models developed for profiling Ethereum accounts to profiling Bitcoin accounts on the Bitcoin network.

Compared to the literature, this paper presents novel contributions from two perspectives. First, it describes and makes publicly available a labelled dataset with Bitcoin blockchain transactions and wallet addresses. The main benefit of this dataset is that it allows more accurate fraud detection models and algorithms to be developed. Second, it uses the dataset to showcase the detection of illicit blockchain transactions, illicit actors, and the risks of de-anonymizing users based on address clustering. For this, it utilizes a number of baseline machine learning models.

3 THE ELLIPTIC++ DATASET

The Elliptic++ dataset consists of 203k Bitcoin transactions and 822k wallet addresses. It leverages elements from the Elliptic dataset [40], a published dataset deanonymizing 99.5% of Elliptic transaction data2, and the Bitcoin addresses dataset obtained by using our Bitcoin Blockchain3 scraping pipeline. A detailed description of the data collection pipeline is included in the Appendix.

3.1 Transactions Dataset

The transactions dataset consists of a time-series graph with 49 distinct time steps, 203,769 transactions (nodes), and 234,355 directed edges representing the payment flows. Each transaction node is labelled as illicit, licit, or unknown; with 2% (4,545) labelled as class-1 (illicit), 21% (42,019) as class-2 (licit), and the remaining transactions are unknown with regard to licit/illicit, hence labelled as class-3. Three csv files are used, as shown in Table 1. Each transaction node has an entry in txs_features.csv, with numerical data for 183 transaction features, and an entry in txs_classes.csv, representing its class label (1: illicit, 2: licit, 3: unknown). Each edge has an entry in txs_edgelist.csv (indexed by two transaction IDs), representing money flow from one transaction to another. Among the 183 node features, 166 features are inherited from the Elliptic dataset, i.e., the time step, 93 local features, representing local information about the transaction, and 72 aggregate features, obtained by


  1. www.kaggle.com/datasets/elliptico/elliptic-data-set
  2. www.kaggle.com/datasets/alexbenzik/deanonymized-995-pct-of-elliptic-transactions
  3. www.blockchain.com

Table 1: Data structure for an example transaction node in the transactions dataset: (1) features, row of all 183 feature values for txId, (2) edgelist, all incoming and outgoing edges involving txId, and (3) classes, the class label for txId.

txs_features.csvtxs_edgelist.csvtxs_classes.csv
txIdTime stepLF_1
27214556024-0.155493
txId1txId2
272145560296926618
272145560272145556
299475624272145560

Table 2: Transactions dataset augmented features.

FeatureDescription
BTC_inTotal BTC incoming
BTC_outTotal BTC outgoing
each has 5 values: total, min, max, mean, median
TXs_inNumber of incoming transactions
TXs_outNumber of outgoing transactions
Addr_inNumber of input addresses
Addr_outNumber of output addresses
BTC_totalTotal BTC transacted
FeesTotal fees in BTC
SizeTotal transaction size
single value

Figure 1: Number of transactions by time step.

aggregating transaction information one-hop forward/backward. The remaining 17 node features are gathered by our Elliptic++ data collection pipeline (with the exception of the 0.5% of transactions that were not deanonymized) as augmented features and are shown in Table 2. Figure 1 shows the distribution of transactions across the three classes in each of the 49 time steps.

3.2 Actors (Wallet Addresses) Dataset

The actors (wallet addresses) dataset is a graph network of 822,942 wallet addresses, each with 56 features as shown in Table 3. Five csv files are used, as shown in Table 4. Each address has an entry in wallets_features.csv, with numerical data for the time

Table 3: Actors (wallet addresses) dataset features.

FeatureDescription
Transaction related:
BTC_transactedTotal BTC transacted (sent+received)
BTC_sentTotal BTC sent
BTC_receivedTotal BTC received
FeesTotal fees in BTC
Fees_shareTotal fees as share of BTC transacted
Time related:
Blocks_txsNumber of blocks between transactions
Blocks_inputNumber of blocks between being an input address
Blocks_outputNumber of blocks between being an output address
Addr_interactionsNumber of interactions among addresses
each has 5 values: total, min, max, mean, median
Classclass label: {illicit, licit, unknown}
Transaction related:
TXs_totalTotal number of blockchain transactions
TXs_inputTotal number of dataset transactions as input address
TXs_outputTotal number of dataset transactions as output address
Time related:
TimestepsNumber of time steps transacting in
LifetimeLifetime in blocks
Block_firstBlock height first transacted in
Block_lastBlock height last transacted in
Block_first_sentBlock height first sent in
Block_first_receiveBlock height first received in
Repeat_interactionsNumber of addresses transacted with multiple times
single value

Table 4: Data structure for an example address in the actors dataset: (1) features, row of 56 feature values for address, (2) classes, the class label for address, (3) AddrAddr, all edges involving address in the actor interaction graph, (4) AddrTx, all edges as involving address as the input address in the address-transaction graph, and similarly (5) TxAddr, as the output address.

addresstime steptxs_input···lifetime_blocks···Addr_interactions_median
39sfuA8pY4UfybgEZi7uvA13jkGzZpsg5K23420···18145···1

Figure 2: Number of wallet addresses by time step.

input_addressoutput_address
39sfuA8pY4UfybgEZi7uvA13jkGzZpsg5K1ML…kTL
input_addresstxId
39sfuA8pY4UfybgEZi7uvA13jkGzZpsg5K272145560

3.3 Graph Visualization

In this section, we provide a detailed explanation with visualization of the Elliptic++ dataset using four types of graphs: Money Flow Transaction Graph, Actor Interaction Graph, Address-Transaction Graph, and User Entity Graph. Each graph has its own unique importance.

  • The Money Flow Transaction Graph shows BTC flow from one transaction to the next, allowing exploration of the spatial and temporal patterns surrounding a given transaction.
  • The Actor Interaction Graph shows the pairwise interactions among input and output addresses of transactions, showing the density of the k-hop neighborhoods of wallet addresses.
  • The Address-Transaction Graph is a heterogenous graph showing flow of BTC across transactions and addresses, allowing an evaluation of purposes of a transaction and the relationships among addresses of the same transaction.
  • The User Entity Graph is an address cluster graph that allows for potential linking of addresses controlled by a specific user, which further de-anonymizes their identity and purpose.

3.3.1 Money Flow Transaction Graph.

This is a directed graph where nodes represent transactions and edges represent directed BTC flows from one transaction to the next. Figure 3 visualizes the distribution of all transactions in time step 32. We choose three transactions (one per class) representing the unknown (yellow), illicit (blue), and licit (red) classes, and display sub-graphs of their four-hop neighbourhoods as shown in the left, middle, and right of Figure 3. All neighbour nodes are shown, though for visual clarity, some edges are abbreviated with dotted arrows signifying that a particular node has two (or more) outgoing edges of the same pattern as the graph is traversed further. This displays the potential utility of exploring the spatial and temporal patterns surrounding a given transaction.

It is important to note that some time steps produce sparse transaction graphs, e.g. time step 14, while others produce dense transaction graphs, e.g. time step 32. We provide a comparison in the Appendix (see Figure 14). This difference in graph density drastically affects the node neighborhood patterns, with some time steps creating shallow, wide money flow transaction graphs, while…

Figure 3: Money flow transaction graphs for selected unknown (left), illicit (middle), and licit (right) transactions in time step 32 (top shows distribution of TS 32 transactions).

other time steps creating deep, narrow money flow transaction graphs. Moreover, although there are only 2% illicit transactions within the dataset, they are evidently spread out across each time step and hence the ML model trained for anomaly detection over the earlier sequence of time steps can be leveraged to perform forensics on the later part of the sequence of time steps.

3.3.2 Actor Interaction Graph. This is a directed graph where nodes represent addresses and edges represent pairwise interactions among input and output addresses of transactions. Figure 4 visualizes the distribution of all addresses in time step 32. Unlike Figure 3 where transaction interactions are captured in terms of money flows, Figure 4 displays the interactions at the address level as opposed to transaction level. The grey area is due to the density of edges. Here, the density of the address graphs at a given time step impact the density of the ( k )-hop neighborhood of a wallet address node. We provide a comparison of actor interaction graphs at time steps 14 and 32 in the Appendix (see Figure 15).

Figure 4: Distribution of wallet addresses in time step 32.

3.3.3 Address-Transaction Graph. This is a heterogeneous directed graph supported by our Elliptic++ dataset. It has two types of nodes, the transaction node (circle) and the address node (square), and two types of edges, the sender-to-transaction edges, each of which represents the BTC flow from an input address (sender) to a transaction, and the transaction-to-receiver edges, each of which represents the BTC flow from a transaction to an output address. Figure 5 shows an address-transaction graph of selected transactions and addresses anchored at a given actor (i.e., the Illicit Actor 13) in time step 32. The flow and quantity of input and output addresses provide information regarding the purposes of a transaction and the relationships among addresses connected by the same transaction. We provide a visual comparison of the distributions of all transactions and addresses at both time steps in the Appendix (see Figure 16).

3.3.4 User Entity Graph. This graph is generated by address clustering analysis rather than direct construction using the Elliptic++ dataset. Clustering addresses involves taking the set of all addresses ( A ) as training data to build a clustering model which creates a set of disjoint subsets ( U = { U_1, U_2, …, U_n } ) (where ( U_1, U_2, …, U_n ) represents ( n ) clusters of addresses in ( A )) such that ( \bigcup_{i=1}^{n} C_i = A ). By treating each cluster (sub-graph) as one user, we construct a user-entity graph of ( n ) nodes, where each node represents a user in the Bitcoin network and the edges represent interactions between unique users.

Figure 5: An address-transaction graph for selected nodes in time step 32 featuring Illicit Actor 13 at the root.

Figure 6: A user entity graph created by grouping transactions with ( \geq 1 ) common element in each set into unique users and adding edges across users if a path existed between groups of transactions in the original graph.

Table 5: The distribution of the number of illicit actors (that appear in 1, 2 − 4, and ≥ 5 time steps) by time step.

| Time step | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |———–|—|—|—|—|—|—|—|—|—|—-|—-|—-|—-|—-|—-|—-| | illicit actor appears in 1 time step | 68 | 69 | 42 | 186 | 36 | 8 | 233 | 116 | 384 | 26 | 185 | 47 | 342 | 74 | 193 | 174 | 118 | | illicit actor appears in 2 − 4 time steps | 5 | 1 | 3 | 15 | 2 | 1 | 28 | 7 | 21 | 11 | 7 | 2 | 4 | 0 | 6 | 8 | 8 | | illicit actor appears in ≥ 5 time steps | 0 | 0 | 0 | 0 | 2 | 2 | 5 | 4 | 3 | 3 | 2 | 1 | 2 | 0 | 2 | 1 | 6 |

Time step1819202122232425262728293031323334
illicit actor appears in 1 time step7712240561269237653210967044397372130131759270447
illicit actor appears in 2 − 4 time steps4520161817409211752112423711
illicit actor appears in ≥ 5 time steps245453121921946852

Table 7: The distribution of the number of transactions and addresses by time step for training and testing sets.

4 FRAUD DETECTION METHODOLOGY

4.1 Dataset Preprocessing

We used an effective 70/30 train-test split with respect to time steps for both the transactions and actors datasets, with time steps 1 to 34 for training and time steps 35 to 49 for testing. Figure 7 graphically shows the distribution of data points (top for transactions, bottom for actors) of all three classes by time step for both the training and testing sets. We provide a detailed distribution of the number of transactions and addresses for all three classes in each of the 49 steps in the Appendix (Tables 12 and 13 respectively). Due to the underlying class imbalance across illicit and licit classes, normalization and standardization transformations are applied. The augmented features in the transactions dataset and all features in the actors dataset are transformed by scaling each feature using the MinMaxScaler to the range (0, 1), reducing imbalance and assisting with model convergence [20].

4.2 Machine Learning Models

Previous studies [30, 40] indicate that ensemble methods perform better than graph neural networks, hence the ML models used for evaluation included Random Forest (RF) [10], Multilayer Perceptrons (MLP) [9], Long Short-Term Memory (LSTM) [15], and Extreme Gradient Boosting (XGB) [12]. We also include Logistic Regression (LR) [16] as the baseline. LR is a single layer neural network which estimates the probability of an event, such as licit or illicit, based on independent variables. RF is an ensemble of classification trees trained on samples with random subsets of features for each decision, evaluating a final decision from averaging all decision trees. MLP is an artificial neural network with at least three layers where data features are fed into input neurons that assign probability vectors for classes as outputs. The Scikit-learn python library was used for LR (default parameters with 1000 max iterations), RF (default parameters with 50 estimators), and MLP (parameters: 1 hidden layer with 50 neurons, 500 epochs, Adam optimizer, 0.001 learning rate). LSTM is a recurrent neural network that has feedback connections and is capable of learning long-term dependencies (sequences of data as compared to single data points). The TensorFlow python library was used for LSTM (hyper-parameters: sigmoid activation, Adam optimizer, 30 epochs, binary cross-entropy loss function, 15 embedding output dims).

XGB is a supervised learning algorithm which predicts variables by combining estimates of prior models. The XGBoost python library was used for XGB (default parameters with “multi:softmax” objective and 2 classes).

4.3 Fraud Detection Evaluation Metrics

The metrics used to verify the models were Precision (ratio of correct classifications), Recall (proportion of actual positive labels correctly classified), F1 Score (harmonic mean of precision and recall), and Micro-Avg F1 (Micro-F1) Score (ratio of correct classifications to total classifications). In some cases to distinguish between classifiers with close performance, the Matthews Correlation Coefficient (MCC) is used due to its suitability for unbalanced datasets and its prior use on financial fraud and cryptocurrency-related studies [1, 5, 44, 47]. We provide the formal definition of these metrics in Section A.2 of the Appendix. In addition, to gain deeper understanding of the different ML models and their classification performance for illicit transactions and illicit addresses, we conduct the following three case studies: (i) EASY cases: all models classify an illicit transaction correctly; (ii) HARD cases: all models classify an illicit transaction incorrectly; and (iii) AVERAGE cases: some models failed to classify an illicit transaction but ≥ 1 models classified correctly.

5 RESULTS AND ANALYSIS

5.1 Statistical Analysis of the Dataset

The building blocks of the Elliptic++ dataset are the transaction features and address features, which directly impact the quality of information gained by the models and quality of classification explainability into the root cause of fraudulent activities. Figure 8 shows three chosen features for the transactions (left) and actors (right) datasets, with feature values in y-axis and time steps in x-axis. We conjecture that important dataset features will show a reflective trend that clearly distinguishes between the illicit and licit classes. Such features will provide a level of interpretation into the risk assessment of a transaction or an actor. In Figure 8, the green curves of the features in the top two rows show the trend of licit transactions (left) and licit actors (right), which are distinctly different from the red curves of illicit transactions (left) and dishonest actors (right). Conversely, some features may not contribute to the detection of illicit transactions and actors, such as the two features in the bottom row. For example, the curves for the illicit and licit transactions in Local Feature 53 can be clearly differentiated through visual analysis, while Local Feature 15 does not display any visual clue on illicit v.s. licit transactions over all time steps. Similar observations are found in the actors dataset.

We can further expand the statistical analysis on the behavioral trends of actors by their features captured in our Elliptic++ dataset, e.g., life span, number of transactions involved, and distribution of actors, which provide additional level of explainability to both the dataset and the classification/detection model performance for illicit/licit/unknown actors. Figure 9 shows the timeline of 15 illicit actors that transact in ≥ 5 time steps. For instance, Illicit Actor 1 transacts in 15 time steps, while Illicit Actor 15 transacts in only 6 time steps. A similar figure for illicit actors existing in only 1 time step (14,007 illicit actors) is included in the Appendix (see Figure 19). Moreover, the number of involved transactions within each time step varies across illicit actors as shown in Figure 10 for the five selected actors. Table 5 (shown on the previous page) provides the distribution of illicit actors across time steps categorized into 3 sets.

Regarding the Bitcoin users, clustering addresses using the previously discussed four steps in Section 3.3.4 created 146,783 user entities. Table 6 shows some relevant statistics. Each user controlled a varying number of addresses, with 98.72% of users controlling ≤ 10 addresses, while only 0.02% of users controlled ≥ 1K addresses.

Table 6: Statistics for Bitcoin users in the Elliptic++ Dataset.

# Users146,783
# Addresses per User: Min1
# Addresses per User: Median1
# Addresses per User: Mean2.73
# Addresses per User: Max14,885
% Users w/ 1 − 10 Addresses98.72%
% Users w/ 11 − 1K Addresses1.26%
% Users w/ 1K − max Addresses0.02%

5.2 Model Evaluation and Analysis

Table 7 and Table 8 show the results of all models trained on the transactions dataset and the actors dataset respectively. From Table 7, the results for the transactions dataset in Elliptic++ (labelled TX) show an increase in performance in most of the metrics for all of the models, compared to the Elliptic dataset (labelled EC). This can be attributed to our addition of 17 augmented features to each transaction, which in turn improves the generalization performance and the explainability of both the transaction dataset and the fraudulent transaction detection models. It is observed that RFTX is the best-performing model (followed by XGBTX and MLP^TX) with a precision of 97.5% and recall of 71.9%. Although RF^TX and RF^EC have comparable precision and recall performance, RF^TX is better as the MCC values are 0.83 vs 0.81. Table 7 also shows the results of 2− and 3−classifier ensembles by selecting the top 3 models (RF, XGB, MLP). The best performing 2− and 3−classifier ensembles are those using Elliptic++. In comparison, the LR, MLP^EC, and LSTM models are ineffective with < 50% precision/recall (highlighted in red). Table 8 shows the results for the actors dataset. It is observed that RF^AR is the best-performing model (followed by XGB^AR and MLP^AR) with a precision of 91.1% and recall of 78.9%. Also, the best 2− and 3−classifier ensembles (RF+XGB^AR and RF+MLP+XGB^AR) show increases in precision (95.9%, 93.3% vs 91.1%), but decreases in recall (53.0%, 57.2% vs 78.9%), indicating the member models of 2− and 3−classifier ensembles are not complimentary and have high negative correlation [43]. The LR^AR and LSTM^AR models are ineffective with extremely low recall (highlighted in red).

5.3 EASY, HARD, and AVERAGE cases Analysis

To further understand the performance of RF models against other models, and the performance results of their best performing 2− and 3−classifier ensembles, we analyze their performance in terms of the EASY, HARD, and AVERAGE cases as defined in Section 4.3. Table 9 (shown on the next page) provides a temporal split of the classification results across the test time steps of 35 to 49 for the EASY case, HARD case, and AVERAGE case respectively. It is important to note that AVERAGE cases present the opportunities for further optimizations. For the AVERAGE cases (correct classification by 1 ≤ x ≤ 4 models), all combinations are shown for each individual model, for the 2/3 models scenario we show top 2 combinations, and for the 4 models scenario we show top 1 combination. First, the case where the 4 models RF, MLP, XGB, and LR correctly classify the transaction accounts for 71% of the AVERAGE cases. Second, the cases where RF classifies incorrectly only make up for < 1% of the AVERAGE cases. This motivates us to focus on optimization of the RF model with feature refinement.

Table 7: Illicit transactions results using individual/ensemble of classifiers. EC refers to classification on Elliptic dataset [40], TX is on our Elliptic++ transactions dataset.

ModelPrecisionRecallF1 ScoreMicro-F1
LR^EC0.3260.7070.4460.886
LR^TX0.3280.7070.4480.884
RF^EC0.9400.7240.8180.979
RF^TX0.9750.7190.8280.980
MLP^EC0.4760.6730.5580.931
MLP^TX0.6110.6130.6120.949
LSTM^EC0.6650.3500.4590.946
LSTM^TX0.7090.2230.3390.942
XGB^EC0.8120.7170.7610.971
XGB^TX0.7930.7180.7540.969

2 classifiers ensemble, selecting top 3 classifiers

ModelPrecisionRecallF1 ScoreMicro-F1
RF+MLP^EC0.9870.6240.7650.975
RF+MLP^TX0.9890.6350.7730.975
RF+XGB^EC0.9600.7040.8120.979
RF+XGB^TX0.9770.7060.8200.979
MLP+XGB^EC0.4570.7370.5640.926
MLP+XGB^TX0.9740.5960.7390.972

3 classifiers ensemble, selecting top 3 classifiers

ModelPrecisionRecallF1 ScoreMicro-F1
RF+MLP^AR0.9470.7190.8170.979
RF+MLP^XGB0.9620.7230.8260.980
RF+MLP+XGB^AR0.9770.7060.8200.979
RF+MLP+XGB^TX0.9740.5960.7390.972

Table 8: Illicit actors results using individual/ensemble of classifiers. AR is classification on our Elliptic++ actors dataset.

ModelPrecisionRecallF1 ScoreMicro-F1
LR^AR0.4770.0460.0830.964
RF^AR0.9110.7890.8450.990
MLP^AR0.7080.5020.5870.974
LSTM^AR0.9220.0330.0640.965
XGB^AR0.8690.5340.6620.980

2 classifiers ensemble, selecting top 3 classifiers

ModelPrecisionRecallF1 ScoreMicro-F1
RF+MLP^AR0.9670.4030.5680.978
RF+XGB^AR0.9590.5300.6820.982
MLP+XGB^AR0.9290.3240.4810.975

3 classifiers ensemble, selecting top 3 classifiers

ModelPrecisionRecallF1 ScoreMicro-F1
RF+MLP+XGB^AR0.9330.5720.7090.983

Table 9: Classification results on the testing dataset showing distributions of EASY, HARD, AVERAGE cases among time steps 35 to 49. Total of each case is 49, 243, and 791 respectively. For the AVERAGE case, distributions are shown for cases where only 1 model (all shown), only 2 models (top 2 shown), only 3 models (top 2 shown), and only 4 models (top 1 shown) classified correctly.

Time Step353637383940414243444546474849TOTAL
EASY320255113000000049
HARD40107428636222041212753243
AVERAGE791

LR | 0 | 0 | 3 | 0 | 2 | 3 | 0 | 6 | 2 | 3 | 1 | 0 | 1 | 9 | 2 | | RF | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | MLP | 0 | 0 | 1 | 1 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | LSTM | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | XGB | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | RF,XGB | 4 | 0 | 0 | 1 | 2 | 1 | 7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | LR,MLP | 1 | 0 | 0 | 1 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | RF,MLP,XGB | 5 | 6 | 0 | 8 | 3 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | LR,RF,XGB | 6 | 1 | 10 | 27 | 18 | 10 | 5 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | RF,MLP,XGB,LR | 124 | 24 | 12 | 57 | 45 | 55 | 81 | 159 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | |

Figure 11: Top 10 and bottom 10 features for the transactions dataset (left) and actors dataset (right).

5.4 Model Optimization by Feature Refinement

Given that the RF is the best performing model, we explore feature refinement to further optimize RF. By examining the results from decision trees, we combine feature importance, permutation feature importance, and drop column feature importance to show the top 10 and bottom 10 features by importance on both transactions dataset and actors dataset, as shown in Figure 11a for TX and Figure 11b for AR. For the transactions dataset, the 17 augmented features produced a collective 12% importance, with the transaction size feature alone responsible for 4.1%. For the actors dataset, 35% of the features were responsible for 80% of the importance, with the total address interaction as the most important feature. Interestingly, a connection can be made to the set of top and bottom 10 features with the features highlighted in Figure 8. It can be seen that there is a link from the green highlight to the top 10 features, and likewise with the red highlight and bottom 10 features. This solidifies that features providing visual proof of the risk attribute to a larger classification importance, and vice versa. The top and bottom 2 of each feature type in both datasets are included in the Appendix Section A.5 for reference. Using this analysis, we run

Table 10: Illicit transactions classification results using selected features, labelled as $\psi$. Micro-F1 denotes Micro-Avg F1.

ModelPrecisionRecallF1 ScoreMicro-F1
RF$^{TX}$0.9750.7190.8280.980
RF$^{TX’}$0.9860.7270.8360.981
RF+XGB$^{TX}$0.9770.7060.8200.979
RF+XGB$^{TX’}$0.9870.7170.8260.980
RF+MLP+XGB$^{TX}$0.9620.7230.8260.980
RF+MLP+XGB$^{TX’}$0.9680.7290.8340.980

the best $1-2-3$ classifiers ensembles with selected features instead of the full set of features. Tables 10 and 11 show the model performance results for transactions and wallet addresses (actors) respectively. Interestingly, the feature refined models show an average improvement of 0.92%, 1.17%, and 0.89% for precision, recall, and F1 score respectively on the transactions dataset, and similarly 1.07%, 3.1%, and 1.13% on the actors dataset.

The increase in recall is also evident in the RF trees voting. Figures 12a and 12b show the cumulative votes of 50 RF trees for 9 chosen transactions/addresses before (left) and after (right) selecting important features for both the transactions and actors datasets respectively. This demonstrates a trend that when some non-contributing features are dropped, those transactions/addresses that are close to the correct manifold in classification will in turn be classified correctly by more trees, increasing in the number of correct votes.

Figure 12: Cumulative votes for 50 RF trees for (a) chosen transactions and (b) addresses before (left) and after (right) feature selection and refinement.

(a) Transactions dataset.

(b) Actors dataset.

6 CONCLUDING REMARKS

With the rapid growth of the cryptocurrency ecosystem, there is a growing demand for robust financial forensics on the blockchain networks. This paper makes three original contributions.

First, we collect and contribute the Elliptic++ dataset, combining over 203k transactions and 822k addresses, and providing four graph representations (Money Flow Transaction Graph, Actor Interaction Graph, Address-Transaction Graph, User Entity Graph) that allow for mining and visualizations of connections for anomaly detection. This enables the detection of both illicit transactions and illicit accounts in Bitcoin networks.

Second, we leverage the four unique graph representations to showcase the fraud detection of illicit transactions, illicit actors (wallet addresses), and the risks of de-anonymization of users using clustering of addresses. We demonstrate the utility of the Elliptic++ dataset for detecting fraudulent transactions and illicit accounts using representative machine learning approaches, including Random Forest, Logistic Regression, Multilayer Perceptrons, LSTM, and Extreme Gradient Boosting. We analyze why Random Forest (RF) is the best-performing model for fraud detection, achieving 97.5% precision and 71.9% recall on the transactions dataset, and 91.1% precision and 78.9% recall on the actors dataset. We also provide comparative analysis of their performance and provide the explainability of the ML algorithms through visualization of all four types of graphs. In addition, we also provide explainability of the performance comparison through detailed analysis on the time series of selected transaction features, and the time series of account features over the time steps of the Elliptic++ dataset.

Third, to further improve the accuracy of the ML algorithms for detecting fraudulent transactions and illicit accounts, we employ ensemble methods to improve the generalization performance of individual ML algorithms. We study ensemble learning with two and three member models, with detailed analysis on the effectiveness of ensemble methods through EASY, HARD and AVERAGE cases. Motivated by our ensemble learning analysis, we show that model training using selective features instead of all extracted features of transactions and of addresses can further improve RF precision and recall performance by 0.92% and 1.17% respectively for the transactions dataset, and 1.07% and 3.1% respectively for the actors dataset. The combination of ensembles and importance-based feature pruning not only demonstrates the utility of the Elliptic++ dataset for fraud detection in Bitcoin network, but also showcases the importance of root cause analysis of fraudulent activities through semantic and statistical explainability of ML algorithms.

The Bitcoin blockchain is the first and the largest cryptocurrency network, boasting as the most widely used cryptocurrency. We believe that the methodology used for collecting the Elliptic++ dataset and the approach developed to exhibit the utilities of this dataset for fraud detection in Bitcoin network can be leveraged to analyze other types of currencies and blockchain networks, including Monero, ZCash, and Ethereum, to name a few. Moreover, the graph representations and the use of time series representations of transaction/account features over the time steps provides guidelines for developing financial forensics models that can be beneficial to other blockchains with high performance and explainability.

The Elliptic++ dataset and its tutorials are made publicly available at https://www.github.com/git-disl/EllipticPlusPlus. We conjecture that making this dataset publicly available will enable the applied data science researchers to conduct financial forensics on the Bitcoin cryptocurrency network and develop more effective fraud detection models and algorithms.

Acknowledgement. This research is partially sponsored by the NSF CISE grants 2302720 and 2038029, an IBM faculty award (002114), and a CISCO edge AI grant (001927). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above.

Table 11: Illicit actors classification results using selected features, labelled as $\psi$ and compared with $AR$. Micro-F1 denotes Micro-Avg F1.

ModelPrecisionRecallF1 ScoreMicro-F1
RF$^{AR}$0.9110.7890.8450.990
RF$^{AR^P}$0.9210.8020.8580.990
RF+XGB$^{AR}$0.9590.5300.6820.982
RF+XGB$^{AR^P}$0.9670.5430.6860.982
RF+MLP+XGB$^{AR}$0.9330.5720.7090.983
RF+MLP+XGB$^{AR^P}$0.9450.6010.7180.984

Useful information for enthusiasts:

Contact me via Telegram: @ExploitDarlenePRO