Detecting Malicious Accounts showing Adversarial Behavior in Permissionless Blockchains

05.03.2025
Detecting Malicious Accounts showing Adversarial Behavior in Permissionless Blockchains

Different types of malicious activities have been flagged in multiple permissionless blockchains such as bitcoin, Ethereum etc. While some malicious activities exploit vulnerabilities in the infrastructure of the blockchain, some target its users through social engineering techniques. To address these problems, we aim at automatically flagging blockchain accounts that originate such malicious exploitation of accounts of other participants. To that end, we identify a robust supervised machine learning (ML) algorithm that is resistant to any bias induced by an over representation of certain malicious activity in the available dataset, as well as is robust against adversarial attacks. We find that most of the malicious activities reported thus far, for example, in Ethereum blockchain ecosystem, behaves statistically similar. Further, the previously used ML algorithms for identifying malicious accounts show bias towards a particular malicious activity which is over-represented. In the sequel, we identify that Neural Networks (NN) holds up the best in the face of such bias inducing dataset as the same time being robust against certain adversarial attacks.

1 Introduction

Blockchains can be modeled as ever-growing temporal graphs, where interactions (also called as transactions) happen between different entities. In a blockchain, various transactions are grouped to form a block. These blocks are then connected together to form a blockchain. A typical blockchain is immutable and is characterized by properties such as confidentiality, anonymity, and non-repudiability. Irrespective of the type of blockchain (which are explained next), these properties achieve a certain level of privacy and security. There are mainly two types of blockchains – permissionless and permissioned blockchains. In permissioned blockchains, all actions on the blockchain are authenticated and authorized, while in permissionless blockchains such aspects are not required for successful transactions. Permissionless blockchains usually also support a native crypto-currency.

Lots of malicious activities, such as ransomware payments in crypto-currency and Ponzi schemes [4], happen due to the misuse of the permissionless blockchain platform. A malicious activity is where accounts perform illegal activities such as Phishing, Scamming and Gambling. In [3], the authors survey different types of attacks and group them based on the vulnerabilities they target in the permissionless blockchain. Thus, one question we ask is can we train a machine learning algorithm (ML) to detect malicious activity and generate alerts for other accounts to safeguard them?

There are various state-of-the-art approaches, such as [1, 19, 31], that use ML algorithms to detect malicious accounts in various permissionless blockchains such as Ethereum [6] and Bitcoin [18]. In [1], the authors outline an approach to detect malicious accounts where they consider the temporal aspects inherent in a permissionless blockchain including the transaction based features that were used in other related works such as [19, 31].

Nonetheless, these related works have various drawbacks. They only study whether an account in the blockchain is malicious or not. They do not classify or comment upon the type of malicious activity (such as Phishing or Scamming) the accounts are involved in. Further, as they do not consider the type of malicious activities, they do not study the bias induced by the imbalance in the number of accounts associated with particular malicious activities and, thus, fail to detect differences in the performance of the identified algorithm on the different kinds of malicious activities. Additionally, they do not study the performance of the identified algorithm when any adversarial input is provided. An adversarial input is defined as an intelligently crafted account whose features hide the malicious characteristics that an ML algorithm uses to detect malicious accounts. Such accounts may be designed to evade ML based detection of their maliciousness.

Thus, we are motivated to answer: (Q1) can we effectively detect malicious accounts that are represented in minority in the blockchain? Also, we consider another question and answer: (Q2) can we effectively detect malicious accounts that show behavior adversarial yp ML based detection? We answer these two questions using a three-fold methodology.

First, we analyze the similarity between different types of malicious activities that are currently known to exist. Here we understand if under-represented malicious activities do have any similarity with other malicious activities. (R2) We then study the effect of bias in the ML algorithm, if any, by the malicious accounts attributed to a particular malicious activity which is represented in large number. We then identify the ML algorithm which we can use to efficiently detect not only the largely represented malicious activities on the blockchain, but also activities that are under-represented. For the state-of-the-art approaches, we train and test different ML algorithms on the dataset and compute the recall considering all malicious activities under one class. Next, to understand the robustness of the identified ML algorithm, we test it on the malicious accounts that are newly tagged with a motivation to understand, if in future, such accounts can be captured. (R3) Nonetheless, we also test the robustness of the ML algorithm on the adversarial inputs. Here, we use Generative Adversarial Networks (GAN) [13] to first generate adversarial data using already known feature vectors of malicious accounts and then perform tests on such data using the identified ML model to study the effect of adversarial attacks.

To facilitate our work, we focus on a permissionless blockchain called Ethereum blockchain [6]. Ethereum has mainly two types of accounts – Externally Owned Accounts (EOA) and Smart Contracts (SC) – where transactions involving EOAs are recorded in the ledger while transactions between SCs are not. There are various vulnerabilities in Ethereum [8] that attackers exploit to siphon off the Ethers (a crypto-currency used by Ethereum). Our study is not exhaustive for all possible attacks in a blockchain. However, we study only those for which we have example accounts. Note that our work is applicable to other permissionless blockchains as well that focus on transactions of crypto-currency. We choose Ethereum because of the volume, velocity, variety, and veracity within the transaction data. Currently, Ethereum has more than 14 different types of known fraudulent activities.

Our results show that certain malicious activities behave similarly when expressed in terms of feature vectors that leverage temporal behavior of the blockchain. We also observe that Neural Network (NN) performs relatively better than other supervised ML algorithms and the volume of accounts attributed to a particular malicious activity does not induce a bias in the results of NN, contrary to other supervised ML algorithms. Moreover, when an adversarial data is encountered, NN’s performance is still better than the other supervised ML algorithms. Henceforth, note that when we refer to the performance of an ML algorithm we mean recall achieved on a testing dataset.

In summary, our contributions, in this paper are as follows:

  1. The similarity scores reveal that there exists similarity between different types of malicious activities present until 7th Dec 2019 in the Ethereum blockchain and they can be clustered in mostly 3 clusters.
  2. All the state-of-the-art ML algorithms used in the related studies are biased towards the dominant malicious activity, i.e., ‘Phishing’ in our case.
  3. We identify that a Neural Network based ML model is the least affected by the imbalance in the numbers of different types of the malicious activities. Further, when adversarial input of the transactional data in the Ethereum blockchain is provided as test data, NN model is resistant to it. When NN is trained on some adversarial inputs, its balanced accuracy increases by 1.5%. When trained with adversarial data, most other algorithms also regain their performance with RandomForest leading having the best recall.

The rest of the paper is organized as follows. In section 2, we present the background and the state-of-the-art techniques used to detect malicious accounts in blockchains. In sections 3, we present a detailed description of our methodology. This is followed by an in-depth evaluation along with the results in section 4. We finally conclude in section 5 providing details on prospective future work.

2 Background and Related Work

Between 2011 and 2019 there have been more than 65 exploit incidents on various blockchains [3]. These attacks mainly exploit the vulnerabilities that are present in the consensus mechanism, client, network, and mining pools. For example, Sybil attack, Denial of Service attack, and the 51% attack. In specific cases such as Ethereum, SC based vulnerabilities are also exploited. For instance, the Decentralized Autonomous Organization (DAO) attack[^1], exploits the Reentrancy bug [22] present in the SC of the DAO system. While some of the attacks exploit bugs and vulnerabilities, some exploits target users of the blockchain. The users are sometimes not well-versed with the technical aspects of the blockchain while sometimes they get easily influenced by the various social engineering attempts. Such exploits are also present in permissionless blockchains such as Ethereum. From the social engineering perspective, Phishing is the most common malicious activity present in Ethereum, where it is represented by more than 3000 accounts [12]. The Table 1 presents different known malicious activities that are reported to exist in the Ethereum blockchain and are used by us in this work. Note that some activities have similar description but are marked differently in Ethereum.

Variety of techniques and approaches have been used to detect, study and mitigate such different types of attacks and vulnerabilities in blockchains. In [8], authors survey different

[^1]: Understanding dao hack: https://www.coindesk.com/understanding-dao-hack-journalists

Malicious Incident/ActivityDescription
Lendf.Me Protocol HackExploit of a reentrancy vulnerability arising due to usage of ERC-777 token [20].
EtherDelta Website SpoofAttackers spoofed the official EtherDelta website so that users transact through it [23].
Ponzi SchemeAn attacker enticed a user to lend him crypto-currency, which the attacker used to repay the debt of previously scammed user.
Parity BugBug in a multi-sig parity wallet which caused freezing of assets.
PhishingWhen attackers pose as legitimate to lure individuals into transacting with them, example Bancor Hack [21].
GamblingAccounts involved in Gambling activities which is illegal in many countries.
Plus Token ScamA Ponzi scheme [5].
CompromisedAccounts whose address were either compromised or were scammed.
ScammingAccounts which are reported to be involved in frauds.
Cryptopia HackNo official description, but unauthorized and unnoticed transfer happened from Cryptopia wallet to other exchanges [14].
Bitpoint hack“Irregular withdrawal from Bitpoint exchange’s hot wallet” [30].
Fake Initial Coin OfferingsFake Startups aimed to siphon off crowd-funded investments.
Upbit HackSpeculation that an insider carried out malicious activity when the exchange was moving funds from hot to cold wallet [15].
HeistAccount involved in various Hacks such as Bitpoint Hack.
Spam TokenNo official description on the activity.
SuspiciousNo official description but account are mainly involved in various scams.
ScamNo official description but account are mainly involved in various scams.
UnsafeNo official description but account are mainly involved in various scams.
BugsAccounts whose associated activities caused issues in the system

Table 1: Malicious Activities on Ethereum blockchain as reported by Etherscan.

vulnerabilities and attacks in Ethereum blockchain and provide a discussion on different defenses employed. We classify these defenses into 3 groups: (a) those that deploy honeypots to capture and analyse transactions in the blockchain, (b) those that use wide variety of machine learning algorithms to analyse transactions, and (c) those that study the vulnerabilities in the system such as bytecode of the smart contracts to analyse malicious activities.

In [9], the authors deployed a honeypot and studied the attacks that happen in the Ethereum blockchain. They analyzed the HTTP and Remote Procedure Call (RPC) requests made to the honeypot and performed behavioral analysis of the transactions. They found adversaries follow specific patterns to steal crypto-currency from the blockchain. Nonetheless, in some instances such honeypots are also compromised, for example the honeypot at the address – ‘0x2f30ff3428d62748a1d993f2cc6c9b55df40b4d7’.

In [1], the authors present a survey of different state-of-the-art ML algorithms that are used to detect malicious accounts in a blockchain transaction network and then presented the need for the temporal aspects present in blockchains as new features. Here, we refrain ourselves from surveying again the methods already presented in [1]. Instead, we present their findings and new techniques that have been reported since.

In [1], the authors categorized the features into two main categories: transaction based and graph based. With respect to the transaction based features, they reported the use of features such as Balance In, Balance Out, and Active Duration. With respect to the graph based features, they identified the extensive use of the features such as clustering coefficient [26] and in/out-degree. The authors, motivated to capture the diversity in the transactions, found that the use of temporal features further enhanced the robustness of the ML algorithm used towards identifying malicious accounts in a blockchain. These features amounted to a total of 59 features and were related to inter-event transaction properties such as the stability of neighborhood (referred to as attractiveness) and non-uniformity (referred to as burst [16]) present in the degree, inter-event time, gas-price, and balance. A complete list of feature used in [1] is presented in appendix B. Using such enriched feature vector, they validated their approach and achieved high recall (> 78%) on the entire class of malicious accounts present in their test dataset.

In [2], the authors used Graph Convolutional Networks (GCN) to detect money-laundering related malicious activities in the Bitcoin blockchain. They developed a Bitcoin transaction graph where the transactions were represented as the nodes while the flow of Bitcoin (crypto-currency used in Bitcoin blockchain) was represented as the edges. They used transaction based features such as amount of Bitcoins received and spent by an account and the Bitcoin fee incurred by a transaction. Using GCN, they achieved F1 score of 0.77.

on their dataset. Similarly, in [17], the authors constructed a transaction graph with similar features as in [2] to detect malicious activities in the Bitcoin blockchain. They compared the use of unsupervised, supervised and active learning approaches. They observe that the Active Learning techniques performed better.

In [27] the authors explored the transactions carried out by different accounts that were involved in Phishing activity. They analyzed the transaction data and proposed trans2vec, a network embedding algorithm, to extract features from the Ethereum blockchain data. They, then, used the extracted features with One Class Support Vector Machine (OC-SVM) to detect accounts involved in phishing activities, and achieved a recall score of 89.3% on the malicious class. Although, they focused on the phishing activities, they did not discuss the applicability of trans2vec with respect to other types of malicious activities.

In [29], the authors developed a framework to analyze the transactions on the Ethereum blockchain and detect various attacks which exploit different vulnerabilities therein. They replayed all transactions related to a particular address and monitor the Ethereum Virtual Machine (EVM) state. They, then, applied logic rules on the transactions to detect abnormal behavior associated with a particular vulnerability and study only Suicidal, UncheckedCall, and Reentrancy vulnerabilities.

In all the above-mentioned work, the authors did not distinguish between the behaviors of different types of malicious activities that are present in permissionless blockchains. Further, no discussion is provided on the effect of adversarial data on the approaches. To the best of our knowledge, in the field of blockchain security, the current work is the first instance that studies data poisoning and evasion and tries to safeguard them against any adversarial attack.

As adversarial data might not be present in the dataset, GAN [13] is one of the most commonly used techniques to generate adversarial data for testing the approach. Based on Neural Network, GANs were originally proposed for usage in the field of computer vision and machine translation, but over the years the technique has gained popularity in various sub-domains of cyber-security such as intrusion detection systems [7]. The architecture of GAN consists of two basic components: a generator and a discriminator. A generator generates fake data that has similar characteristics as the original data. The fake generated data along with the real data is passed to the discriminator which discriminates the input data and identifies if it is real or fake. Both the generator and the discriminator are trained iteratively over the dataset. Over time, generator becomes more intelligent, making it hard for the discriminator to correctly classify the real and the fake data. There are many variants of GANs that are available.(^2) We do not describe all the techniques and variants of GAN as this is out of the scope of this work. However, here, we only describe CTGan [28] that we use to generate fake data for every malicious activity represented by accounts in our dataset. Our choice of CTGan is based on the fact that CTGan is able to generate tabular data. In CTGan, the generator and discriminator models contain 2 fully connected hidden layers, which account for capturing correlation in the feature vectors given to them. They use Rectified Linear Units (ReLU) activation function along with batch normalization in generator model to reduce over-fitting. ReLU is defined as the positive part of the argument passed to activate the neuron. The discriminator has leaky ReLU activation along with dropout regularization [25] implemented in each hidden layer. When using CTGan for data generation, for the best results, the authors recommend the number of epochs (where one epoch represents single iteration over the dataset) to be greater than 300. One limitation of CTGan is that the generator model needs at least 10 feature vectors to generate adversarial data.

3 Our Approach

In this section, we describe our three-fold approach towards answering our research questions, in detail.

3.1 Computing similarity amongst malicious accounts

We compute cosine similarity measure amongst accounts attributed to different known malicious activities to understand if the malicious activities have similar behavior. We acknowledge that there are other methods to quantify similarity, but in this work we use cosine similarity as it is widely adopted. As the accounts are associated with a specific type of malicious activity, besides computing the individual similarity, we compute and analyse pairwise cosine similarity among the accounts associated with malicious activities. Assume a dataset (D_a) of malicious and benign accounts in a permissionless blockchain. We assume that each account is attributed to one and only one malicious activity. In reality, an account can have multiple associations. Consider, two malicious activities, (M_1) and (M_2) from a set of malicious activities (M) that have set of accounts (A_{M_1}) and (A_{M_2}) associated with them, respectively. We compute cosine similarity (\text{CS}{i,j}) such that (i \in A{M_1}), (j \in A_{M_2}) and (A_{M_1} \cap A_{M_2} = \emptyset) and then identify the probability of it being more than or equal to 0 ((p(\text{CS}{i,j} \geq 0))). If for all (i) and (j), (\text{CS}{i,j} \geq 0) then we say that the two malicious activities, (M_1) and (M_2), are similar.

Then, we use clustering algorithm with the motivation that accounts which are associated with the same malicious activity would cluster together and show homophily [10]. We use K-Means algorithm to validate if indeed similarity exists between the two malicious activities. Here, we assume an upper limit on (k) (hyperparameter for K-Means) and use (k = ||M|| + 1). Note that (||M||) represents the size of the set.

of different malicious activities, i.e., the number of different activities under consideration and +1 part represents the benign cluster. However, our results show that most of the accounts, even if they are associated with different malicious activities, cluster together. Note that in terms of the number of clusters found, the best case scenario would have been malicious accounts associated with different malicious activities cluster separately and the worst case would be all malicious accounts, irrespective of their associated malicious activity, cluster together.

3.2 Bias Analysis

The distribution of the number of accounts associated with each ( M_i \in M ) is not uniform. This increases the sensitivity of the ML algorithm towards the ( M_i \in M ) that is more prominent, i.e., has more number of associated accounts. Thereby, they induce a bias in the selected model towards ( M_i ) that has the most number of associated accounts. To understand the effect of the number of malicious accounts attributed to a particular ( M_i ) on the ML algorithm, we segment ( D_a ) into different training and testing sub-datasets and use them to train and test ML algorithms. Let the set of different training and testing sub-datasets be ( C = { C_0, C_1, \cdots, C_n } ) where each element ( C_i ) represent a specific split of ( D_a ). Let ( T r^{C_i} ) denote the training dataset, which contains 80% of randomly selected accounts from ( C_i ) and ( T s^{C_i} ) denote the test dataset, which contains the remaining 20% accounts. The different ( C_i )’s we use are:

  • Null model or ( C_0 ): This is our baseline sub-dataset. Here we do not distinguish between the types of malicious activities rather only consider if an account is malicious or not. Note that here based on above notations, the training dataset is represented as ( T r^{C_0} ) and the testing dataset as ( T s^{C_0} ).

Let ( A_{S_0}^{M_1} ) represent the set of accounts associated with ( M_1 ) type malicious activity in the testing dataset, ( T s^{C_1} ). As our aim here is to analyze the bias caused due to a malicious activity, for example ( M_1 ), we analyse the results obtained when training and testing the ML algorithm using different combinations of accounts associated with ( M_1 ) activity. For instance, we analyse the performance of an ML algorithm when accounts associated with ( M_1 ) are not present in training dataset but are present the testing dataset. Below we list all such combinations that we use:

  • ( C_1 ): Here we train on ( T r^{C_0} ) but test on ( T s^{C_1} ) where ( T s^{C_1} = T s^{C_0} – A_{S_0}^{M_1} ), i.e., we train on the 80% of the dataset, but we remove all the accounts associated with activity ( M_1 ) from the testing dataset. Ideally, in this case, ML algorithm should perform similar to ( C_0 ) since the training dataset is same.
  • ( C_2 ): Here, we train on ( T r^{C_2} = T r^{C_0} – A_{C_0}^{S_0} ) but test on ( T s^{C_2} ) which is same as ( T s^{C_0} ), i.e., we remove all the accounts associated with activity ( M_1 ) from the training dataset, but we keep the accounts associated with ( M_1 ) in the testing dataset. Ideally, in this case, ML algorithm should misclassify accounts associated to ( M_1 ) that are present in ( T s^{C_2} ). In case adverse results are obtained, it would mean that there is a bias.
  • ( C_3 ): Here, we train on ( T r^{C_2} ) and test on ( T s^{C_3} ) which is same as ( T s^{C_1} ), i.e., we remove all the accounts associated with activity ( M_1 ) from both the training and the testing dataset. Ideally, in this case, ML algorithm should perform similar to ( C_0 ) since no accounts associated with ( M_1 ) activity are present in ( T r^{C_2} ) and ( T s^{C_3} ). In case adverse results are obtained, it would mean that there is a bias.
  • ( C_4 ): Here, we train on ( T r^{C_4} = T r^{C_0} + A_{S_0}^{M_1} ) and test on ( T s^{C_4} ) which is same as ( T s^{C_1} ), i.e., we remove all the accounts associated with activity ( M_1 ) from the testing dataset and add them to the training dataset. Ideally, in this case, ML algorithm should perform similar to ( C_1 ) since no accounts associated with ( M_1 ) activity are present in ( T s^{C_4} ).

Note that the above four configurations do test different, yet all scenarios that are required to understand the effect of a malicious activity, ( M_1 ), on the ML algorithm. For the sake of completeness, we also consider the following training and testing sub-datasets:

  • ( C_5 ): Here, we perform a 80-20 split of the number of accounts associated to each malicious activity in ( M ). We then collect all these 80% data-splits along with 80% benign accounts to create the training dataset, ( T r^{C_5} ). Similarly, we collect all the remaining 20% splits to create the testing dataset, ( T s^{C_5} ). We then train on the resulting 80% of the dataset and test on the remaining 20% of the dataset. Ideally, in this case, ML algorithm should perform similar to ( C_0 ).

Among the supervised ML algorithms, in [1], the authors presented that the ExtraTrees Classifier (ETC) performs the best on the data they had. We use ETC with the identified hyperparameters on the above-mentioned sub-datasets to identify the bias induced by a malicious activity. Further, as mentioned before, it is possible that a classifier trained on ETC might fail to capture new or adversarial data. Thus, we also apply different supervised ML algorithms on ( C_0 ) and identify the algorithm that achieves the best recall on the entire malicious class. We then execute the best identified ML algorithm on the above mentioned sub-datasets. To test the performance of different supervised ML algorithms on any new data, we collect the new and more recent malicious accounts transaction data (( D_b )) and execute the identified algorithms on the new transaction data.

3.3 Adversarial Analysis

The new collected data ($D_g$) shows a low similarity with existing malicious activities. Such data does not classify as adversarial. We use CTGan [28] to generate adversarial data for the malicious activities and use this new adversarial data ($D_g$) to validate our findings. Here, we use $D_g$ only on the test dataset, i.e., we perform training on $T_{s0}$ while we perform our tests on $T_{sDg}$ that includes $D_g$ and all the benign accounts in $T_{sC0}$. Further, we also perform tests after training different ML algorithms when 1%, 5%, and 80% of such adversarial feature vectors are added to the training dataset.

In summary, Table 2 provides a list of notations we use. Here, note that we first generate the feature vector dataset using the same approach as listed by the authors in [1]. Then, towards methodology one (R1), we compute the cosine similarity amongst the malicious accounts in $D_a$ to understand how similar they are. We also apply unsupervised ML algorithms such as K-Means on the malicious accounts in $D_a$ to identify if malicious accounts cluster together (see Figure 1). Towards methodology two (R2), we divide $D_a$ into different training and testing sub-datasets, $C = {C_0, C_1, \cdots, C_5}$, and execute different supervised ML algorithms to understand bias induced by a particular malicious activity and identify the performance of the reported classifier on the transaction data made available until 27th May 2020 (see Figure 2). Towards methodology three (R3), we first filter out all the malicious accounts from $D_a$ and use CTGan separately on each malicious activity to generate adversarial ($D_g$) data for all the activities where we have more than 10 accounts. Note that there are two ways using which we can generate adversarial data: (1) using feature vectors of malicious accounts and (2) using features vectors of benign accounts. These represent, (1) evading ML algorithm from being detected as malicious, and (2) evading ML algorithms from being detected as benign and getting classified as malicious. In this work, we generate adversarial data using malicious accounts. Further, for CTGan, we use default parameters to synthesize the data, and use 1000 epochs (an epoch represents one iteration over the dataset) for fitting. We then compare the performance of various state-of-the-art supervised ML algorithms on $D_g$ (see Figure 3).

4 Evaluation

In this section, we first present the data we use and then provide a detailed analysis of our results.

4.1 Data Used

We use external transactions data present in the Ethereum main-net blockchain. Ethereum [6] is one of the most widely adopted permissionless blockchain platform. It uses Proof-of-Work (PoW) consensus mechanism to validate transactions of the users. Ethereum provides users with the functionality to deploy additional programs called smart contracts which can be used to control the flow of the Ethers. In Ethereum, EOAs are accounts/wallet which is owned by a real-entity or a person; wherein the hash of public key of the owner of an EOA is the address of the EOA. On the other hand, SCs are similar to EOAs, with the exception that they contain code to automate certain tasks, such as sending and receiving Ethers, invoking, creating, and destroying other smart contracts when needed. SCs can be created and invoked both by EOAs and by other SCs. There are two types of transactions which occur on the Ethereum blockchain, External and Internal Transactions. While the external transactions occur between different EOAs and between EOAs and SCs, they are recorded on the blockchain ledger. Internal transactions are created by and occur between different SCs and are not recorded on the blockchain ledger. Further, SCs can execute external calls which are then recorded on the blockchain ledger. An external transaction typically has information about blockHash (hash of the block in which the transaction is present), blockNumber (another identifier to represent the block in which the transaction is present), the account from which the transaction is invoked, the account to which the transfer was made, gas (the amount ‘account present in the from field of transaction’ is willing to pay to the miners to include the transaction in the block), gasPrice (cost per unit gas), Transaction hash (hash of the transaction), balance (Ethers transferred), and the timestamp of the block. Such transaction are then grouped together into blocks before being published onto the blockchain.

We use Etherscan API [11] to get transaction data of 2946

malicious accounts that were marked until 7th December 2019. As the number of benign accounts were more than 117 Million account, we perform under-sampling to get external transactions of 680,314 benign accounts. Note that these 680,314 benign accounts and 2946 malicious accounts collectively constitute our dataset $D_b$ (defined in section 3). Since using a similar approach in [1], the authors obtained good results on similar dataset by including the temporal features, we use the same set of 59 features ($F$) in our work, as well. In short, these 59 features are based on and can be classified under: (a) temporal graph based features that includes indegree, outdegree, attractiveness, inter-event time burst, degree burst, and clustering coefficient, and (b) transaction based features that includes activeness, crypto-currency balance and fee paid.

Etherscan provides for each malicious account a caution warning message so that other accounts can exercise caution while transacting with them. Other than such warning message, Etherscan also provides information about what malicious activity the account is involved in. These activities are: Phishing (2590 accounts), Scamming (168 accounts), Compromised (21 accounts), Upbit Hack (123 accounts), Heist (13 accounts), Gambling (8 accounts), Spam Token (10 accounts), Suspicious (4 accounts), Cryptopia Hack (3 accounts), EtherDelta Hack (1 accounts), Scam (1 accounts), Fake ICO (2 accounts), Unsafe (1 accounts), and Bugs (2 accounts). Thus, in the dataset, there are in total 14 different types of malicious activities. For our different training and testing sub-datasets, we currently only focus on “Phishing” as it is the most prominent malicious activity in our dataset. Further, note that all the ‘Bitpoint Hack’ accounts were also tagged under ‘Heist’. Therefore, we only use ‘Heist’. In $D_a$, we observe that 101 unique EOAs created 153 different malicious SCs. These EOAs are not marked as malicious in the dataset. Most of the EOAs created only one malicious SC, while, one EOA created 15 malicious SCs. There are only 3 SCs that are created by 3 different SCs which in turn were created by the 2 different EOAs. However, we refrain from adding these accounts in our analysis so as to not change any ground truth. We do not reveal the identities of these accounts because we do not want to malign any future transactions.

As this list is dynamic, between 8th December 2019 and 27th May 2020, there were 1249 more accounts that were identified as malicious by Etherscan. On top, 487 accounts out of 2946 previously identified malicious accounts continued to transact until 27th May 2020. These total 1736 malicious accounts constitute our dataset $D_b$ (defined in section 3). The accounts in $D_b$ are associated with: Upbit Hack (691 accounts), Parity (131 accounts), Phishing (842 accounts), Ponzi (38 accounts), Gambling (28 accounts), Compromised (1 accounts), Unknown (2 accounts), Lendf.Me Hack (2 accounts), Plus Token Scam (1 accounts), Heist (1 accounts). We again notice that, in this case also, some accounts have more than 1 label associated with them. Based on our assumption, to ease our process, we associate these accounts with only one type of malicious activity. When analyzing $D_b$, we remove from $T^C_0$ all those accounts that were common in $D_a$ and $D_b$ and moved them to $T^C_0$, and retrained the different supervised ML algorithms to identify their performance.

4.2 Results All our results are averaged over 50 iterations on our dataset and are generated using Python3.

4.2.1 R1: Similarity Analysis Some of the malicious activities present in the dataset have similar definition. We calculate cosine similarity score between all the malicious accounts present in $D_a$ to identify similarity that exists between them (see Figure 4). From the figure, we observe that in many cases $CS_{i,j} < 0$, as expected because they potentially belong to different malicious activities. Nonetheless, we also observe that some pair of accounts have high similarity between them. There are two reasons for such high similarity: (1) these accounts could actually belong to the same malicious activity, and (2) although, the account might represent same malicious activity, in Ethereum, they might be marked differently, i.e., these accounts have been marked for malicious activities that have similar description. To understand this further, we check all the accounts associated with a pair of malicious activities and mark the two activities as similar if all the accounts in one type of malicious activity have cosine similarity more than 0 with all the malicious accounts in the other type. Figure 5 depicts the probabilities $p(CS_{i,j} < 0)$ (see Figure 5a) and $p(CS_{i,j} \geq 0)$ (see Figure 5b) between all the accounts related to two malicious activities. Note that the two figures are complementary to each other as $p(CS_{i,j} < 0) = 1 – p(CS_{i,j} \geq 0)$. From the figures, we notice that many activities show high probability of similarity with other malicious activities, for example, ‘Bugs’

Tag NameTotalCluster 1Cluster 2Cluster 3
Phishing25901700492398
Upbit Hack†12332190
Scamming1681162725
Heist‡13912
Compromised‡21171
Unsafe11
Spam Token†10136
Bugs22
EtherDelta Hack11
Cryptopia Hack321
Gambling‡87
Suspicious431
Fake ICO22
Scam11

† not well clustered in cluster 1, ‡ not well clustered in 3 clusters

and ‘Unsafe’. There are $\approx 158$ pairs of malicious activities where $p(CS_{i,j} \geq 0) > 0.5$. Further, we also notice that within ‘Phishing’, although most account are similar, there are some accounts which show dissimilarity.

We use K-Means algorithm to analyze the clustering patterns of all malicious accounts present in the dataset $D_a$ to further understand if the accounts identified as similar show homophily. Here, we use $k = 15$. We see that most of the malicious accounts, irrespective of the malicious activity they belong to, cluster together in at most 3 clusters except for the accounts that belong to malicious activities ‘Heist’, ‘Compromised’, and ‘Gambling’ (see Table 3). Further, in the cluster that had most number of the malicious accounts (i.e., cluster #1), except for accounts associated with ‘Upbit Hack’ and ‘Spam Token’, all other malicious accounts clustered together in cluster #1.

Therefore, we infer that different malicious activities in a blockchain such as Ethereum, behave in a similar manner. Same labels could be used to depict certain malicious activities, such as ‘Phishing’ and ‘Fake ICO’. Currently, we do not need 14 different labels as most of the malicious activities could be captured in at most 3 clusters.

4.2.2 R2: Bias Analysis

We first test ETC, as [1] reported it to produce the best results, on different training and testing sub-datasets ($C_t$) to understand the bias induced due to the imbalance in the number of accounts associated with a particular malicious activity. For ETC, we use the hyperparameters reported by [1]. These are: class_weight = 'balanced', criterion = 'entropy', max_features = 0.3, max_samples = 0.3, min_samples_leaf = 14, min_samples_split = 20, n_estimators = 200. All other hyperparameters are kept as default. In our dataset, since most number of malicious accounts are associated with ‘Phishing’ activity, here we choose ‘Phishing’ for our study and our training and testing sub-datasets are configured accordingly. Note that this analysis is also valid for other malicious activities. From Table 4, we observe that, for $C_2$ and $C_3$, the recall on the accounts tagged as malicious deteriorates significantly. For $C_2$, we expected such results because there are no ‘Phishing’ accounts in the training dataset ($T_r^{C_2}$), but are present in test dataset ($T_s^{C_0}$). However, for $C_3$, the recall on the malicious accounts was below expectation. This proves the existence of bias in ETC towards ‘Phishing’. To understand further and know which all malicious activities are impacted due to such bias, we study the confusion matrix and the distribution of malicious accounts therein.

From the Table 5, we observe that for $C_2$, though only accounts attributed to ‘Phishing’ activities are removed from the training dataset, more than 50% of the accounts attributed to ‘Scamming’ are misclassified. Same results are observed for $C_3$, when accounts associated with ‘Phishing’ are removed from both the training dataset ($T_r^{C_3}$) and the test dataset ($T_s^{C_3}$). However, if ‘Phishing’ related accounts are present in the training dataset, as in $C_0$, $C_1$ and $C_4$, less number of accounts tagged as ‘Scamming’ are misclassified. A similar observation is made for the accounts associated with ‘Fake ICO’. This is also obtained from the results in the previous subsection (R1), where we see that ‘Phishing’ and ‘Scamming’ based accounts, show high similarity. This validates the results we obtain using different configurations.

Data Untilsub-datasetsETCNN
07/12/2019MalBenMal
$C_0$0.700.990.83
$C_1$0.760.990.86
$C_2$0.280.990.79
$C_3$0.590.990.83
$C_4$0.790.990.87
$C_5$0.700.990.88

ETC ExtraTreesClassifier(class_weight = ‘balanced’, criterion = ‘entropy’, max_features = 0.3, max_samples = 0.3, min_samples_leaf = 14, min_samples_split = 20, n_estimators = 200)

NN NeuralNetworks(epochs = 50, regularization = l2(0.0001), dropout = 0.5, loss=’binary crossentropy’, optimizer=’adam’)

Since ETC shows bias, we execute different supervised ML algorithms on the different training and test sub-datasets created from $D_a$. Here, on the features we have, we study those ML algorithms that were used by the related works. Table 6 depicts the recall obtained after different supervised ML algorithms were studied on $C_0$. Here, we keep the hyperparameters of the supervised ML algorithms to either default or to those values that are reported by the related works. Apart from the supervised ML algorithms specified in Table 6, we also test the performance of GCN using our features on the dataset $D_a$. GCN achieves a recall score of 32.1% on the dataset $D_a$. From the set of ML algorithms, we identify that Neural Network (NN) (with hyperparameters: epochs = 50, regularization = $l2(0.0001)$, dropout = 0.5, loss=’binary crossentropy’, optimizer=’adam’) performs the best on $C_0$ and achieves a recall of 0.83 on the malicious class. However recall on benign class drops to 0.94 resulting in a balanced-accuracy (average recall on all classes) of 88.5%. We also note that although recall on malicious class is high for NN, recall on benign class is relatively lower. Overall, the balanced accuracy is still high. To understand if NN is biased towards ‘Phishing’, we execute NN on different training and test sub-datasets we have. We find that NN has better recall on $C_2$ and $C_3$ as well (see Table 4). Further, we find that NN was able to correctly classify most of the accounts associated with different malicious activities irrespective of the bias (see Table 5).

Next, we perform experiments to understand which supervised ML algorithm performs the best in the scenario when new data on already known malicious activities is presented. From Figure 6, we find that the similarity between the malicious accounts in $T_sC_0$ and those in $D_b$ is less. We then test the performance of different supervised ML algorithms using $D_b$ to understand if these new malicious accounts are classified correctly on already trained supervised ML algorithms, i.e., when we train on $T_rC_0$ while test on all benign accounts in $TS_{C_0}$ and all malicious account in $D_b$. Table 6 also presents the recall score obtained in this scenario. Here, we again observe that Neural Network performs better than the rest of the supervised ML algorithms and was able to classify more than 434 account out of 1736 malicious accounts. Thus, we get further confirmation that NN performs the best and is not affected by the bias caused by high number of accounts associated with ‘Phishing’ malicious activity in the Ethereum transaction data.

4.2.3 R3: Adversarial Analysis

Next, to answer our other research question and test the effect of adversarial data on the supervised ML algorithms, we first generate adversarial data from accounts attributed to the particular malicious activities and then test various supervised ML algorithms on the generated datasets ($D_g$). For each malicious activity where the number of associated accounts are sufficiently large, we generate 1000 adversarial samples. For malicious activities that are moderately represented, we generate 50 adversarial samples. Since for all the other malicious activities we had less than 10 accounts, we did not generated adversarial accounts for those malicious activities. Thus, in $D_g$, the distribution of accounts associated with different malicious activities is as follows: Phishing (1000 accounts), Scamming (1000 accounts), Upbit Hack (1000 accounts), Spam Token (50 accounts), Compromised (50 accounts), Heist (50 accounts).


Useful information for enthusiasts:

Contact me via Telegram: @ExploitDarlenePRO