BitcoinHeist: Topological Data Analysis forRansomware Detection on the Bitcoin Blockchain

06.03.2025
BitcoinHeist: Topological Data Analysis forRansomware Detection on the Bitcoin Blockchain

Proliferation of cryptocurrencies (e.g., Bitcoin) that allow pseudo-anonymous transactions, has made it easier for ransomware developers to demand ransom by encrypting sensitive user data. The recently revealed strikes of ransomware attacks have already resulted in significant economic losses and societal harm across different sectors, ranging from local governments to health care.

Most modern ransomware use Bitcoin for payments. However, although Bitcoin transactions are permanently recorded and publicly available, current approaches for detecting ransomware depend only on a couple of heuristics and/or tedious information gathering steps (e.g., running ransomware to collect ransomware related Bitcoin addresses). To our knowledge, none of the previous approaches have employed advanced data analytics techniques to automatically detect ransomware related transactions and malicious Bitcoin addresses.

By capitalizing on the recent advances in topological data analysis, we propose an efficient and tractable data analytics framework to automatically detect new malicious addresses in a ransomware family, given only a limited records of previous transactions. Furthermore, our proposed techniques exhibit high utility to detect the emergence of new ransomware families, that is, ransomware with no previous records of transactions. Using the existing known ransomware data sets, we show that our proposed methodology provides significant improvements in precision and recall for ransomware transaction detection, compared to existing heuristic based approaches, and can be utilized to automate ransomware detection.

I. INTRODUCTION

This decade has been marked with the rise of blockchain based technologies. In its core, blockchain is a distributed public ledger that stores transactions between two parties without requiring a trusted central authority. On a blockchain, two unacquainted parties can create an immutable transaction that is permanently recorded on the ledger to be seen by the public. One of the first applications of Blockchain has been the Bitcoin cryptocurrency [27]. Bitcoin’s success has ushered an age known as the Blockchain 1.0 [37], and currently there exist more than 1000 Blockchain based cryptocurrencies.

Bitcoin transactions can be created anonymously, and participation in the network does not require identity verification. A payment can be requested by delivering a public Bitcoin address (i.e., a short string) to a sender by using anonymity networks such as Tor [10]. This ease of usage and worldwide transaction availability of Bitcoin have been noticed by malicious actors as well. The pseudo-anonymity of cryptocurrencies has attracted the interest of a diverse body of criminals, transnational terrorist groups, and illicit users.

Cryptocurrency related crime and, more generally, criminal abuse of blockchain technologies are nowadays recognized as the fastest-growing type of cybercrime [18], [31].

Among the malicious usage of cryptocurrencies, ransomware payments continue to attract an ever increasing attention. Although encrypting files and resources for ransom has a long history, receiving ransom payments securely had never been simple until the emergence of Bitcoin. For example, in 1989 the first documented ransomware AIDS Trojan demanded payments via international money order or cashiers check sent to a P.O. box in Panama. Worldwide transactability of Bitcoin protects ransomware operators from revealing their identity to collect the payments. As Paquet-Clouston et al. [30] state, “The combination of strong and well-implemented cryptographic techniques to take files hostage, the Tor protocol to communicate anonymously, and the use of a cryptocurrency to receive unmediated payments provide altogether a high level of impunity for ransomware attackers”. Starting with CryptoLocker in 2013, the world has seen an explosion of ransomware that uses Bitcoin. Many organizations that have been hit by ransomware are asked to pay significant amounts using Bitcoin. For example, Jackson County, Georgia hit with Ryuk in March, 2019 needed to pay bitcoins equal to $400,000, since attackers also compromised the backup data [7].

Furthermore, using cryptocurrencies for ransomware payments appears to be substantially more prevalent than have been previously realized. As noted by Hernendex-Castro et al. [15], among the respondents to their survey, “the prevalence of the Cryptolocker ransomware (3.4%) seems much higher than expected. The proportion of Cryptolocker victims that claim to have agreed to pay the ransom to recover their files (41%) seems to be much larger than expected (3% was conjectured by Symantec, 0.4% by Dell SecureWorks)”. Hence, understanding the ransomware payments and its overall economical impact is an emerging challenge of critical societal importance.

Although there have been efforts to analyze the cryptocurrency transactions such as Bitcoin using various heuristics (e.g., “co-spending” heuristic that is based on the idea that input addresses to the same transaction must belong to the same person since private keys associated with those accounts are needed for creating valid transactions [26]) to detect ransomware payments, to our knowledge, none of the previous efforts has leveraged advanced data science tools to detect malicious ransomware payments.

In this paper, our goal is to identify Bitcoin addresses that are used to store and trade Bitcoins gained through ransomware activities. To achieve this goal, we propose a scalable data-driven Bitcoin transaction analytics framework which is substantially more effective in detecting ransomware payment related addresses, compared to the existing heuristic based approaches.

The key thrust behind our proposed approach is the intrinsic capability to observe the complete blockchain graph and, as a result, to track and analyze dynamics of the associated blockchain topological and geometrical properties at a multi-resolution level. A natural question arises: What does local topological blockchain graph structure tells us on anomalous and malicious patterns? For example, does a suspiciously repeating occurrence of a certain blockchain payment pattern tend to be associated with ransomware payment behavior? We show that by using local topological information available on the Bitcoin transaction graph, we can achieve significant improvement in detecting malicious Bitcoin addresses associated with ransomware payments.

Significance of our contributions can be summarized as follows:

  • To our knowledge, this is the first project that leverages advanced data analytics and machine learning techniques for ransomware related bitcoin address detection.
  • We propose a simple, tractable and computationally efficient framework to extract features related to Bitcoin transactions which exhibit high utility in predicting ransomware related activities.
  • Using the ground truth data collected by various studies, we develop a novel topological data analysis (TDA) based ransomware detection approach that delivers significantly higher precision and recall compared to existing heuristic based procedures.
  • In addition to detecting new addresses associated with a known ransomware family, we show that our new methodology could be used to detect the emergence of new ransomware families.

II. RELATED WORK

The success of Bitcoin [27] has encouraged hundreds of similar digital coins [39]. The underlying Blockchain technology has been adopted in many use cases and applications. With this rapidly increasing activity, there have been numerous studies analyzing the blockchain technology from different perspectives.

The earliest results aimed at tracking the transaction network to locate coins used in illegal activities, such as money laundering and blackmailing [3], [29]. These findings are known as the taint analysis [9].

Bitcoin provides pseudo-anonymity; although all transactions are public by nature, user identification is not required to join the network. Mixing schemes [24], [33] exist to hide the flow of coins in the network. Research articles have shown that some Bitcoin payments can be traced [26]. As a result, obfuscation efforts [28] by malicious users have become increasingly sophisticated.

In ransomware analysis, Montreal [30], Princeton [17] and Padua [8] studies have analyzed networks of cryptocurrency ransomware, and found that hacker behavior can help us identify undisclosed ransomware payments. Datasets of these three studies are publicly available.

Early studies in ransomware detection use decision rules on amounts and times of known ransomware transactions to locate undisclosed ransomware (CryptoLocker) payments [19]. More recent studies are joint efforts between researchers and blockchain analytics companies; Huang et al. identify shared hacker behavior and use heuristics to identify ransomware payments [17]. The authors estimate that 20000 victims have made ransomware payments. However, these studies do not extract features and build machine learning models to detect ransomware payments and families.

Feature extraction has been studied for ransomware detection in specific domains. In software code analysis, Cryptolock inspects ransomware programs and their activity for malicious characteristics [34]. In this line of work, studies on ransomware for mobile devices extract software code features to catch malicious programs [23], [2]. However, these studies are mainly targeted for detecting ransomware before it can infect a system, and do not consider Bitcoin transactions.

III. BACKGROUND AND PRELIMINARIES

A. Ransomware

Ransomware is a type of malware that infects a victim’s data and resources, and demands ransom to release them. In two main types, ransomware can lock access to resources or encrypt their content. In addition to computer systems, ransomware can also infect IoT and mobile devices [23].

Ransomware can be delivered via email attachments or web based vulnerabilities. More recently, ransomware have been delivered via mass exploits. For example, CryptoLocker used Gameover ZeuS botnet to spread through spam emails. Once the ransomware is installed, it communicates with a command and control center. Although earlier ransomware used hard-coded IPs and domain names, newer variants may use anonymity networks, such as TOR, to reach a hidden command and control server.

In the case of asymmetric encryption, the encryption key is delivered to the victim’s machine. In some variants, the encryption key is created on the victim’s machine and delivered to the command center.

Once resources are locked or encrypted, the ransomware displays a message that asks a certain amount of bitcoins to be sent to a bitcoin address. This amount may depend on the number and size of the encrypted resources. After payment, a decryption tool is delivered to the victim. However, in some cases, such as with WannaCry, the ransomware contained a bug that made it impossible to identify who paid a ransomware amount.

B. Bitcoin Graph Model

Largely rooted in the existing network analysis methodology, earlier Bitcoin data analytics techniques approached Bitcoin data by creating a graph that employs only a single type of node. Such analytic procedures are referred to as transaction and address graph approaches.

In the transaction graph approach, addresses are ignored and edges are created among transaction nodes [12, 32]. Naturally, the transaction graph is acyclic and a transaction node cannot have new edges in the future.

In the address graph approach [36], transactions are ignored and edges are created among address nodes. However, Bitcoin does not store input to output coin flows explicitly – all inputs are gathered at the transaction, and directed to output addresses at once. As a result, inputs of a transaction must be connected to all its output addresses, which may create large cliques if too many addresses are involved in a transaction.

Single node type approaches do not provide a faithful representation of the Blockchain data. The loss of information about addresses or transactions seems to have an impact in predictive models [14, 1]. In contrast, we model the Bitcoin graph as a heterogeneous network with two node types: addresses and transactions.

Bitcoin Graph Model We consider a directed weighted graph (\mathcal{G} = (V, E, B)) created from a set of transactions (TX) and input and output addresses contained in (TX). On (\mathcal{G}), (V) is a set of nodes, and (E \subseteq V \times V) is a set of edges. (B = {\text{Address, Transaction}}) represents the set of node types. For any node (u \in V), it has a node type (\phi(u) \in B). For each edge (e_{u,v} \in E) between adjacent nodes (u) and (v), we have (\phi(u) \neq \phi(v)), and either (\phi(u) = {\text{Transaction}}) or (\phi(v) = {\text{Transaction}}). That is, an edge (e \in E) represents a coin transfer between an address node and a transaction node.

This heterogeneous graph model subsumes the homogeneous case (i.e., (|B| = 1)), where only transaction or address nodes are used, and edges link nodes of the same type. Here, we focus on the case where each address node is linked (i.e., input or output address of a transaction) via a transaction node to another address node. We use (\Gamma^o_u) and (\Gamma^i_u) to refer to predecessors (in-neighbors) and successors (out-neighbors) of an address (u), respectively.

IV. METHODOLOGY

In this paper, we state the following five questions to analyze ransomware behavior on the Bitcoin blockchain: 1. What features extracted from the Bitcoin network can be used to detect ransomware behavior? 2. Does a given ransomware family (e.g., Cryptolocker) show the same behavior on the Bitcoin blockchain over time? 3. How similar is the behavior of different ransomware operators on the Bitcoin blockchain? 4. Can we detect Bitcoin ransom payments that are not reported to law agencies or Blockchain Analytics companies? 5. Based on the information about existing ransomware families at a given time, can we detect the emergence of a new ransomware on the Bitcoin blockchain?

To address questions 1-5, we formulate two primary research problems: i) detecting undisclosed payments to addresses that belong to a known ransomware family and ii) predicting the emergence of a ransomware family unknown to the date. We start by stating the notations used in our problem definitions. Symbols used in this manuscript are given in Tab. [X]

Let ({a_u}_{u \in Z^+}) be a set of addresses, and let each address (a_u) be associated with a pair ((\vec{x}u, y_u)), where (\vec{x}u \in \mathcal{R}^D) is a vector of its features and (y_u) is its label. That is, depending on a setting, (y_u) can designate a white (i.e., non-ransomware) address or a ransomware address. Furthermore, we associate timestamp (t_u) to represent the time when the address (a_u) first appeared in a blockchain transaction. An address can appear in the blockchain multiple times. Let (f_1, \ldots, f_n) be labels of known ransomware families (see Table [X]) which have been observed until time point (t), and let (f_0) be a label of addresses which are not known to belong to any ransomware family and are assumed to be white addresses. Before time point (t), if we observe (l) addresses (a{l+1}, \ldots, a{l+z}), then we form their (D \times l)-matrix of features (X_t = {\vec{x}_1, \ldots, \vec{x}_l}) and a vector of labels (Y_t = {y_1, \ldots, y_l} \in {f_0, f_1, \ldots, f_n}).

We formally define our research problems as follows:

Problem 1 [Existing Family Detection]: Let (rs) be a known ransomware family of interest. Let (Y_t \subseteq Y) be such that (\forall y_j \in Y_t, y_j \in {f_0, f_{rs}}) and (X_t \subseteq X) be the corresponding matrix of features. (If at time point (t), (Y_t \cap {f_{rs}} = \emptyset), increase (t) such that (Y_t) contains at least one (f_{rs}).) Now, let ({a_{l+1}, \ldots, a_{l+z}}) be a set of addresses whose set of labels (Y_t = {y_{l+1}, \ldots, y_{l+z}}) is unknown, and let (X_t = {\vec{x}{l+1}, \ldots, \vec{x}{l+z}}) be a set of their corresponding observed features. Furthermore, let (t’ > t), and (t < \min{t_{a_{l+1}}, \ldots, t_{a_{l+z}}}). The problem is to predict all addresses (a_m \in {a_{l+1}, \ldots, a_{l+z}}) such that (y_m = f_{rs}), using their currently available set of features (X_{t’}) and previous history ((X_t, Y_t)).

Problem 2 [New Family Prediction]: Let (rs’) be a new, yet unobserved ransomware family, and (f_{rs’}) be its label. Let ((X_t, Y_t)) be a pair of the sets of features and labels, respectively, such that at time point (t), (\forall y_j \in Y_t, y_j \neq f_{rs’}).

Now, let ({a_{l+1}, \ldots, a_{l+z}}) be a set of addresses whose set of labels (Y_t = {y_{l+1}, \ldots, y_{l+z}}) is unknown, and let (X_t = {\vec{x}{l+1}, \ldots, \vec{x}{l+z}}) be a set of their corresponding observed features. Furthermore, let (t’ > t), and (t < \min{t_{a_{l+1}}, \ldots, t_{a_{l+z}}}). The problem is to predict all addresses (a_m \in {a_{l+1}, \ldots, a_{l+z}}) such that (y_m \neq f_{rs}) and (a_m) is associated with the new ransomware (rs) (i.e., (y_m = f_{rs})), using their currently available set of features (X_{t’}) and previous history ((X_t, Y_t)).

A. Graph features for classification

On the heterogeneous Bitcoin network, the in-neighbors (\Gamma^o_u) of a transaction (tx_n) is defined as the set of transactions (not addresses) whose one or more outputs are input to transaction (tx_n). The out-neighbors of (tx_n) are denoted as (\Gamma^i_n).

A transaction has inputs and outputs; the sum of output amounts of a transaction (tx_n) is defined as (A(n) = \sum_{a_u \in \Gamma^o_n} A_u(n)), where an output address (a_u) receives (A_u(n)) coins.

On the Bitcoin network, an address may appear multiple times with different inputs and outputs. An address (u) that appears in a transaction at time (t) can be denoted as (a_t). As most addresses appear only once, for sake of simplicity, we omit (t) throughout our notations when it is clear from the context. In order to mine address behavior in time, we divide the Bitcoin network into 24 hour long windows by using the UTC-6 timezone as reference. This window approach serves two purposes. First, the induced 24 hour network allows us to capture how fast a given coin moves in the network. The speed is measured by the number of blocks in a 24 hour window that contain a transaction involving the coin. In maximum, a coin can appear in 144 blocks (24 hours, 6 blocks per hour). Coin speed may disclose certain information on transaction intent. For example, coins that are moved too frequently in the network may be involved in money laundering. Second, the temporal order of transactions within the window helps us distinguish transaction activity from different geolocations. Temporal information of transactions, such as the local time, has been found useful to cluster criminal transactions (see Figure 7 in [25]).

On the heterogeneous Bitcoin network, in each snapshot we extract the following six features for an address: income, neighbors, weight, length, count, loop.

Income of an address $u$ is the total amount of coins output to $u$: $I_u = \sum_{t_3 \in TX_u} A_{t_3}^o (u)$.

Neighbors of an address $u$ is the number of transactions which have $u$ as one of its output addresses: $|\Gamma_u^o|$.

We define the next four address features by using their time ordered position in the defined 24 hour time window. We denote time of a window with the earliest time $t$ of transactions contained in it. For each window, we first locate the set of transactions that do not receive outputs from any earlier transaction within the studied window $t$, i.e., $TX = {\forall tx_n \in TX, s.t., \Gamma^o_n = {a_1^o, \ldots, a_t^o}, t^0 \leq t < t}$. These transactions consume outputs of transactions that have been generated in previous windows. For simplicity, we refer to a transaction $tx \in TX$ as a starter transaction.

Weight of an address $u$, $W_u$, is defined as the sum of fraction of coins that originate from a starter transaction and reach $u$.

Length of an address $u$, $L_u$, is the number of non-starter transactions on its longest chain, where a chain is defined as an acyclic directed path originating from any starter transaction and ending at address $u$. A length of zero implies that the address is an output address of a starter transaction.

Count of an address $u$, $C_u$, is the number of starter transactions which are connected to $u$ through a chain, where a chain is defined as an acyclic directed path originating from any starter transaction and ending at address $u$.

Loop of an address $u$, $Q_u$, is the number of starter transactions which are connected to $u$ with more than one directed path.

Rationale: Our graph features are designed to quantify specific transaction patterns.

Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency (see Figure 7 in [25] for examples of such patterns).

Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address (see aggregations in Figure 1 of [17] for a potential application of this pattern).

Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount (what percent of these transactions’ output?) of transactions.

Length is designed to quantify mixing rounds [24] on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin (see the mixing rounds in Figure 2 of [33]).

Example 1 (Graph Features): Consider the toy network in Figure 1. We begin from defining the starter transactions $TX = {tx_1, tx_3, tx_4, tx_5}$ as in this specific window, they receive no coins from an earlier transaction.

We assign each of the four transactions a weight of 1, regardless of their output amounts $A^o(\cdot)$. Addresses $a_1$ and $a_2$ receive $W_{a_1} = 1/|\Gamma_{tx_1}^o| = 1/2$ and $W_{a_2} = 1/|\Gamma_{tx_1}^o| = 1/2$ from transaction $tx_1$, respectively. Address $a_9$ receives $1/2$ of the weight from $a_1$. Hence, $W_{a_9} = (1/2) \times (1/2)$. Consider $a_{10}$ which receives weights from $tx_3$ and $tx_4$. Weight of $tx_3$ flows through $a_5$ and $a_9$, whereas the weight of $tx_4$ flows through $a_7$ and $tx_7$ to reach $a_{10}$. Hence, $W_{a_{10}} = W_{a_9} + W_{a_5} + W_{a_6} = (1/2) + [1/2 + 1/2]$.

Length of $a_1$ and $a_2$ is 0, but $L_{a_3} = L_{a_4} = 1$, since $a_4$ and $a_5$ can be reached from $tx_1$ through a non starter transaction $tx_2$. Furthermore, $a_{10}$ can be reached from $tx_3$ or $tx_4$, hence $L_{a_{10}} = 1$. Although $a_8$ and $a_9$ appear later than $a_3$ and $a_4$, lengths of $a_8$ and $a_9$ are shorter, because these addresses are direct outputs of the starter transaction $tx_5$.

The address $a_{10}$ has three chains: i) $tx_3 \rightarrow a_5 \rightarrow tx_6 \rightarrow a_{10}$, ii) $tx_4 \rightarrow a_6 \rightarrow tx_6 \rightarrow a_{10}$ and $tx_4 \rightarrow a_7 \rightarrow tx_7 \rightarrow a_{10}$, but these chains start from three starters only (i.e., $tx_3$ and $tx_4$). Hence, its count $C_{a_{10}} = 2$.

Fig. 1: A toy network of 10 addresses and 7 transactions. Dashed edges indicate transaction outputs from earlier windows; $t_1$, $t_3$, $t_4$, and $t_5$ are starter transactions. Coin amounts are shown on edges. Transaction outputs are equal to transaction inputs, i.e., transaction fees are 0.

Transaction ( tx_4 ) has two chains that reach ( a_{10} ) through ( tx_6 ) and ( tx_7 ), separately. These two chains form a loop from ( tx_4 ) to ( a_{10} ). Hence, ( O_{a_{10}} = \lvert {tx_4} \rvert = 1 ). All other nodes have zero loops.

Feature Standardization. In data pre-processing, we use feature standardization to make the values of each feature have zero-mean and unit-variance. We first compute mean and standard deviation for each feature and subtract the mean from each feature value. Then we divide each feature value by its standard deviation.

B. Ransomware address classification and clustering

Before detailing our TDA model, in this section we first outline four classification and clustering models that we employ in this paper.

Similarity Search:

We use addresses contained in a specific time window ( t ), and compute pairwise similarity to known ransomware addresses from the last ( l ) days.

Heuristics

Following two heuristics that are defined by Meiklejohn et al. [26] are used in our experimental evaluation.

Co-spending: “If two addresses are inputs to the same transaction, they are controlled by the same user”.

Transition: “If we observe one transaction with addresses A and B as inputs, and another with addresses B and C as inputs, then we conclude that A, B, and C all belonged to the same user”.

We did not implement the change heuristic proposed in [26], because addresses discovered by the change heuristic are not guaranteed to belong to the same user.

Unsupervised approaches

DBSCAN density clustering: Density-based spatial clustering of applications with noise (DBSCAN) is a density-based non-parametric clustering algorithm. DBSCAN can mark outlier points that lie alone in low-density regions as noise [11].

Hierarchical clustering: We use k-means clustering with Forgy algorithm based initial seed selection [13].

Tree based approaches

Random forest is a supervised ensemble of multiple simple decision trees to estimate the dependent variables of the data [16].

C. Topological Data Analysis Models: TDA Mapper

In this project we introduce the concepts of topological data analysis (TDA) into detection of ransomware patterns on Blockchain. The fundamental idea of TDA is to extract hidden data patterns via systematic analysis of data shapes such as, cycles and flares, quantified at various resolution scales [5], [21].

TDA offers the following multi-fold benefits which are particularly useful in the context of ransomware detection. First, TDA assesses data shapes in a coordinate-free manner, which implies that we can systematically compare patterns of data obtained under various data collection frameworks. Second, TDA analyzes properties of data shapes that are robust to minor data perturbations. Hence, TDA tends to deliver more consistent results on hidden data patterns even under noisy data collection schemes which is a typical scenario in money laundering studies. Finally, TDA offers a low dimensional description of some key properties of the high-dimensional data by using a finite combinatorial object for systematic extraction of data shape patterns. In this paper, we employ the Mapper method of [35], [5] which is a highly customizable TDA tool allowing for a systematic multi-resolution glimpse into organization and functionality behind the underlying data generating process. By complementing more traditional clustering and projection pursuit approaches with a systematic insight on data geometry and topology, Mapper often recovers hidden data patterns that are otherwise inaccessible with conventional data analytic techniques.

The key idea behind Mapper is the following. Let ( U ) be a total number of observed addressed and ( {\vec{x}u}{u=1}^U \in R^D ) be a data cloud of address features. Select a filter function ( \xi : {\vec{x}u}{u=1}^U \rightarrow R ).

Let ( I ) be the range of ( \xi ), that is, ( I = [m, M] \in \mathbb{R} ), where ( m = \min \xi(\vec{x}_u) ) and ( M = \max \xi(\vec{x}_u) ). Now place data into overlapping bins by dividing the range ( I ) into a set ( S ) of smaller overlapping intervals of uniform length, and let ( u_j = {u : \xi(\vec{x}u) \in I_j} ) be addresses corresponding to features in the interval ( I_j \in S ). For each ( u_j ) perform a single linkage clustering to form clusters ( {u{jk}} ).

To find the number of clusters, Mapper analyzes empirical distribution of edge lengths at which each cluster is merged based on the rationale that internal distances (i.e., within a cluster) are expected to be lower than external distances (i.e., in-between clusters) and distributions of internal and external distances are disjoint. Let ( {u_{jk}} ) be addressed in the ( k )-th cluster of the ( j )-th interval. View each cluster as a node and draw an edge between two nodes ( k ) and ( m ) if clusters ( {u_{jk}} ) and ( {u_{lm}} ) contain overlapping addresses, i.e., ( {u_{jk}} \cap {u_{lm}} \neq \emptyset ).

As a result, Mapper produces a low dimensional representation of the underlying data structure in the form of “cluster tree” graph ( CT ) where each “cluster” is a branch of some single connected component rather than a disconnected component on its own as in conventional clustering analysis. In Fig. 2, we show an example of the produced Mapper graph. Each node may contain three sets of addresses: past ransomware addresses, past non-ransomware addresses, and addresses of the current time window, whose labels are unknown. If current addresses are contained in clusters that also contain many We filter the TDA mapper graph of using each of our six graph features. As a result, we obtain six filtered graphs $CT_1, \ldots, CT_6$ for each time window. Afterwards, we assign a suspicion, or risk score to an address $a_u$, as outlined in Algorithm 1:

Algorithm 1: TDA filtering with multiple attributes.

Input: A set of networks $CT \in {CT_1, \ldots, CT_6}$; filter threshold $\epsilon_1$; inclusion threshold $\epsilon_2$; set of past ransomware addresses $RS$; set of past non-ransomware addresses $NRS$.

Output: A set of suspicious addresses.

$P : Map \leftarrow$ Initialize scores of all $l$ addresses with 0;

foreach network $CT \in {CT_1, \ldots, CT_6}$ do

  • $A_c \leftarrow$ select all addresses in $C_c$;
  • $V \leftarrow A_c \cap RS$;
  • if $|V| \geq \epsilon_1 \times |RS|$ then
  • $\text{if } |A_c| \leq \epsilon_2 \times |CT, V|$ then
  • $P_u \leftarrow 1 + P_u$;
  • $q_t \leftarrow \text{quantile}(P, q)$;
  • $S = {\forall a_u \in P | P_u \geq q_t}$;
  • return $S$;

Ransomware addresses in each cluster. If both inclusion and size thresholds, $\epsilon_1$ and $\epsilon_2$, respectively, are satisfied, addresses in the cluster have their suspicion scores incremented.

Parameters. We use two parameters to control what we learn from mapper clusters: inclusion and size parameters. The inclusion parameter $\epsilon_1$ limits what can be learned when very few ransomware addresses are contained in the cluster. The size threshold $\epsilon_2$ prevents learning when cluster includes too many addresses. Such phenomenon usually happens if a filtering feature does not exhibit a sufficiently discriminating performance during a specific time window, and all addresses are lumped together. For example, in Figure 2 the largest cluster contains 78 past ransomware addresses. Given this finding and the cluster size of 537 addresses, the risk scores of the remaining 459 addresses are to be elevated – that is, the procedure results in testing of too many addresses as potentially suspicious. We further use a quantile threshold $q$ on addresses, and label addresses suspicious only if they are in the top $1 – q$ of all addresses. We emphasize that by controlling $q$, $\epsilon_1$ and $\epsilon_2$ parameters, we can avoid making predictions when evidence of past ransomware is not sufficiently strong.

We will denote TDA models with the $TDA_{q, \epsilon_1, \epsilon_2}$ notation. TDA models may deliver nested results; for example $TDA_{0.5, 0.1}$ may return the same set of suspicious addresses as $TDA_{0.7, 0.2}$ results. For such cases, we prefer the most restrictive model; i.e., the model with the highest $q$, highest $\epsilon_1$ and lowest $\epsilon_2$.

In summary, the key benefit of the Mapper approach is its capability to recover hidden similarities (i.e., connections) between these “clusters”, or groups of addresses, that are typically unavailable with traditional clustering techniques.

V. Ransomware Behavior

Dataset. We have downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and formed the Bitcoin graph. We filtered out the network edges that transfer less than $B0.3$, since ransom amounts are rarely below this threshold. As mentioned in Sec. VII, in final verification we used the full graph to discover smaller amounts that we may have missed due to this filtering.

Our ransomware dataset is a union of datasets from three widely adopted studies: Montreal [30], Princeton [17] and Padua [8]. The combined dataset contains 24,486 addresses.

from 27 ransomware families. From these addresses, we extract the features listed in Tab. XI. The column $N$ shows the number of times addresses from a given family appear in the Bitcoin blockchain. The unique column gives the number of unique addresses for each family. In 24 ransomware families, at least one address appears in more than one 24-hour time window. Fig. 3 shows a histogram of appearances. CryptoLocker has 13 addresses that appear more than 100 times each. The CryptoLocker address ILz6hvEtahH6LZwM4WWhfgyo6zRJvQuqT appears for a maximum of 420 times. Four addresses have conflicting ransomware labels between Montreal and Padua datasets. APT (Montreal) and Jigsaw (Padua) ransomware families have two and one multisig addresses, respectively. All other addresses are ordinary addresses that start with ‘1’.

A. A summary on ransomware features

LengthWeightNeighborCountLoopIncome#Address Rank
00.5210B1327
00.5210B1.2250
01.0210B1189
01.0110B0.5178
00.5210B0.8160
01.0110B1146
01.0210B1.2127
00.5210B1.25119
00.5110B0.5118
01.0110B2117

We compute six feature values of 48,168 ransomware address appearances. We find 30K unique feature values (i.e., unique rows of data), which we term patterns. Table lists the top 10 most frequent patterns in ransomware address features. Here, 54 of the top 100 most frequent patterns of all addresses are also found in the top 100 of ransomware patterns. The Rank column shows the rank of a pattern in all transactions. We find that four of the top-10 ransomware address patterns do not appear in the top-100 patterns of all addresses. Furthermore, length=0 implies that ransomware addresses receive payments from starter transactions; in the given window, such addresses have no past history. Finally, weight 0.5 implies that the starter transaction pays the ransomware address, and uses a change address. We also find that a ransomware address typically receives all coins of the transaction, i.e., weight=1.0.

Table XI shows mean values of the graph features for ransomware addresses. CryptoLocker, CryptXXX, Locky, Cerber, DMALockerv3 and CryptoWall have more than 100 unique addresses. The Razy ransomware exhibits an outlier behavior in five features. In weight, only DMALocker has a value > 1. WannaCry family is an outlier in both length and loop features. Figure 3 depicts distributions of two features, count and loop. When compared to non-ransomware addresses, ransomware addresses exhibit more profound right skewness in distributions of feature values.

B. Feature Relevance for ransomware behavior

We compute feature values for each appearance of ransomware addresses, and visualize the six-dimensional feature data in Fig. 5 for the six largest families. Each data point is given a location in a two-dimensional map by the t-stochastic neighborhood embedding [22]. We show results for three perplexity values; perplexity “can be interpreted as a smooth measure of the effective number of neighbors” [22]. Distances and sizes may not be meaningful in TSNE, however, groups of data points exhibit interesting patterns. Each family can be seen to have small groups of co-clustering addresses, implying that a few addresses have similar features. As perplexity increases, we find that addresses tend to cluster together, with DMALocker and CryptXXX addresses appearing at the center. CryptoWall and CryptoLocker addresses appear together in many groups. The most important insight to be gained from TSNE results is that addresses from a ransomware family do not all exhibit the same behavior – each family have a multitude of patterns that repeat across addresses.

C. Ransomware behavior similarity

We use ransomware labels as external ground truth, and cluster ransomware addresses in up to 20,000 clusters. We compute cluster purity for each family, which we define as the percentage of addresses from a family appearing in clusters where every address in the cluster belongs to the same ransomware family. In computing purity, we only consider clusters that contain more than one data point. Fig. 6 depicts purity for the eight biggest ransomware families by address count; Cerber CryptoWall and CryptoLocker reach almost 40% purity. As in Fig. 6, we find an optimal number of $k = 12500$ clusters.

Next, we analyze the resulting clusters for relationships between ransomware families. In Tab. II, we show the number of clusters that are shared by ransomware pairs. CryptoWall addresses appear in 1152 pure clusters (all addresses belong to CryptoWall). CryptoLocker and CryptoWall addresses appear in 2587 impure clusters. In Tab. III, we further look into these shared clusters and count how many addresses are co-clustered in families. We see that 5122 CryptoLocker addresses (1st row, 2nd column) are clustered with 4826 CryptoWall addresses (2nd row, 1st column). From Tab. III, we again see that CryptXXX more often co-clusters with CryptoLocker. The case of CryptoLocker-CryptoWall is intriguing; CryptoLocker addresses cluster more often with CryptoWall addresses than with other CryptoLocker addresses.

In August 2014, the CryptoLocker ransomware was taken down by Operation Tovar: a consortium constituting FBI, Interpol, security software vendors and several universities. Two months earlier, the CryptoWall ransomware had been discovered. CryptoLocker spread through email attachments, whereas in addition to attachments, CryptoWall uses kits hosted on compromised websites or malicious ads. CryptXXX appeared in April 19, 2016.

Clustering uses address features that encode transaction behaviors of two sets of users: initial ransom payers, and hackers that move paid ransom amounts on the network. As initial payers are unknown to each other, we do not expect them to exhibit similar behavior. We believe that the similarity implied by the Fig. 4: Distributions of feature values for ransomware and non-ransomware addresses. Ransomware addresses are skewed for both features. The skew also exists in weight and length features (not shown here).

Fig. 5: Two dimensional t-Stochastic Neighbor embeddings for addresses from six ransomware families. The perplexity can be interpreted as a measure of the effective number of neighbors to be considered for each point.

Fig. 6: Clustering ransomware addresses by using six features.

clustering results can only be due to shared hacker behavior. As such, CryptoLocker, CryptoWall and CryptXXX may be created and run by the same hackers. As a supporting evidence, for CryptXXX, the security company Symantec warns that “Definitions (of the detection tool) prior to September, 2016 may detect this threat as Trojan.CryptoLocker.AN”) [38] (We could not find such a note for other ransomware families).

D. Address features over time

In our detection and prediction problems, we assume that ransomware addresses exhibit similar feature patterns in time, and we can learn to identify these patterns. However, the TSNE results in Sec. V-B show that a global behavior may not exist.

CryptoCryptoCryptNoobDMA
WallLockerCerberLockyXXX
115225879761403374
6901220841379142
6474935621820
2962668755
49245
628
2

In fact, when we compute ransomware patterns, Tab. IV shows that only 10 families have feature patterns that are repeated in time (see Tab. I patterns). For example, Cerber has 3491 addresses use 564 unique patterns more than once. Some of

CryptoCryptoCryptNoobDMA
WallLockerCerberLockyXXX
3145512217373015624
4826176623801843618
34152285188597631696
407730231512832460
102410771467355155
344164189525
14010721616

In fact, when we compute ransomware patterns, Tab. IV shows that only 10 families have feature patterns that are repeated in time (see Tab. I patterns). For example, Cerber has 3491 addresses use 564 unique patterns more than once. Some of these pattern repeating addresses appear in hundreds of time windows: the address tLk3ly/5820yGj3klhRHp0q+6ZBOqZtP/ appears in 420 windows. We cluster it in 66 clusters multiple times (as given in Sec. V-C); in one cluster it appears six times. Overall, 100 addresses from CryptoLocker, CryptoWall, NoobCrypt, DMA Locker v3, Globe Imposter, DMA Locker and Crypto XXX families have repeating patterns in time. These results offer evidence that although we cannot talk of a general behavior, small groups of addresses from at least 10 families can be used to discover other ransom addresses.

TABLE IV: Repeating patterns in ransomware transactions.

RansomwareUnique pattern# addresses
Cerber5643491
Locky4032704
CryptoWall3891406
CryptoLocker2481201
CryptXXX1411018
DMA Locker v31956
Globe v337
DMA Locker36
Globe Imposter12
NoobCrypt12

VI. RANSOMWARE DETECTION AND PREDICTION

In this section we first discuss two factors that we account for in training our models.

Class imbalance. Compared to 24,486 known ransomware addresses from 2013 to 2018, there exist 200K daily transactions on average on the Bitcoin network, where each transaction contains an average of 5 input and output addresses. As a result, we observe a large number of ( f_0 ) (i.e, non-ransomware) addresses, leading to a class imbalance problem in classification. To reduce class imbalance, we train on a limited number of addresses; that is, we employ ( N \in {300, 600, 1000} ) samples of ( f_0 ) and ( f_1, \ldots, f_n ) (i.e., ransomware) addresses each. Our choice is due to the fact that on most days we observe much fewer than 100 unique ransomware addresses.

Metrics. For each model and ransomware family, we compute precision and recall by using overall sums of TN, FN, FP and TP values across multiple time windows. Precision and recall are computed as ( \frac{TP}{TP + FP} ) and ( \frac{TP}{TP + FN} ), respectively. As we observe considerably more non-ransomware addresses, we identify the positive likelihood ratio (( PLR = \frac{TP}{FP} )) as an important metric, since PLR quantifies the effort needed for analyzing the blockchain by using model prediction results. For example, a ( PLR ) of 1 implies that for every TP address, the analyst has to manually analyze one FP address, unnecessarily.

Parameters. In all models, we report the optimal parameters that maximize F1 scores in predictions as follows: In DB-SCAN, we experimented with ( \epsilon = 0.05, \ldots, 1 ) values. Random Forest uses ( ntree=500 ) and ( mtry=|X_t|/3 ). XGBoost uses the gtree booster and rounds = 25. For TDA computations, we use the TDAMapper RStats package [https://github.com/paulpearson/TDAMapper] with parameters overlap=40, interval = 80.

A. Existing family detection

We outline our experimental settings for the detection problem as follows:

  1. Select a ransomware family ( rs ) whose new addresses will be detected at time ( t’ ).
  2. For ( t < t’ ), use a training length ( l ), and create a dataset ( X_t ) which holds features and labels of addresses observed between times ( t – l ) and ( t ).
  3. Create an ( f_0 ) sample of size ( N ) from ( X_{0[t-l,t]}^n ) without replacement where ( \forall x_u \in X_{0[t-l,t]}, y_u = f_0 ) and ( N = \frac{|X_{0[t-l,t]}|}{|X_{0[t-l,t]}|} ).
  4. Create a ransomware sample of size ( N ) from ( X_{rs[t-l,t]} ) without replacement where ( \forall x_u \in X_{rs[t-l,t]}, y_u = f_{rs} ) and ( N \leq \frac{|X_{rs[t-l,t]}|}{|X_{rs[t-l,t]}|} ).
  5. Using the ground truth data at ( t’ ), find all ransomware addresses for ( t’ ); ( X_{t’}^{rs} ).
  6. Using the ground truth data at ( t’ ), take a sample of ( M = 1000 ) white (i.e., ( f_0 )) addresses without replacement: ( X_{t’}^{0} ).
  7. Remove past known addresses from ( X_{t’}^{rs} ) , i.e., ( X_{t’}^{rs} \leftarrow X_{t’}^{rs} \setminus X_{t’}^{0} ).
  8. Use features ( {X_{0[t-l,t]}^n \cup X_{rs[t-l,t]} } ) and labels ( {Y_{0[t-l,t]}^n \cup Y_{rs[t-l,t]} } ) as the training data, and classify ( {X_{t’}^{0} \cup X_{t’}^{rs} } ).

We emphasize four aspects of existing family detection: i) [Step 7]: from the test dataset we remove appearances of addresses that have appeared in the past (i.e., ( t < t’ )), because their labels are already known. ii) [Step 4]: If an address appears in multiple windows, its each appearance will have (potentially) different features in ( X_{t-l,t} ) with the same ( rs ) label. iii) [Step 7]: On many days, we will not have ( N ) past ( rs ) addresses to train from. iv) Most importantly, we learn a model for each ransomware family. In our experiments, we show that these models do not share the same characteristics.

Heuristics. By employing the co-spending and spending heuristics with all past history (i.e., ( N = |X_t| ), and ( l = \infty )), we discover only 40 unique addresses from Crypto Locker (Padua), Crypto Wall (Padua), Crypto Tor Locker 2015 (Montreal), Crypto Tor Locker 2015 (Padua) families.

Naive similarity search. An interesting benchmark for detecting undisclosed payments from existing families is to compute the similarity of ( X_t ) addresses to the past addresses in ( X_t ). If existing families exhibit repeating patterns over time, the similarity search can match new addresses to known ransom addresses. In fact, Fig. 7 shows that this strategy may be effective. By using exact matches to known ransom patterns of the past, in six days similarity search reveals more than 50 addresses each day, while predicting a maximum of 73 false Fig. 7: Ransomware detection with exact feature matches.

positives on 2013 day 336. These results offer supporting evidence of utility of our features. However, this naive approach creates 21,371 FP addresses overall, which makes it unfeasible for operational use by security analysts.

RS MethodlNTPFPFNTN#wPrecRecF1PLR
Locky TDA240 3004512350508221110.1610.9000.2730.192
COSINE90 30023954168139901463691940.0540.3750.0950.057
Crypto TDA240 600217308715511200150.0660.5830.1180.070
Wall DBSCAN240 6007281896079416913590.0370.4780.0690.038
Crypto TDA240 300439968621222129340.0430.6740.0810.045
Locker DBSCAN60 3009354277129511316670.0210.7600.0420.022
Cerber TDA120 300187517445923027290.0350.2890.0620.036
XGBOOST240 600160664730772793741694360.0330.1810.0560.034
Crypt TDA90 30077246027111057140.0300.2210.0530.031
XXX COSINE30 6005892087261042952650.0270.4910.0520.028

Table V shows the main results of our models. For each ransomware family, TDA has a hyper-parameter set that produces the best model. However, these parameter values are not the same across families. Sample size (N) and training length (l) parameters are different as well. For each family, we also provide the best non-TDA model for comparison. The Locky ransomware has the best results with a precision of 0.161 and recall of 0.9. In general, models yield better recall than precision. In Table V, #w is the number of windows where a model makes at least one label prediction. By using the q, ϵ1 hyper-parameters, TDA models avoid predicting labels when level of confidence in the derived classification is low.

Similar to TDA, DBSCAN can ignore data points in clustering; two of the best non-TDA results are delivered by DBSCAN models. In the best TDA models for each ransomware family, we predict 16.59 false positives for each true positive. In turn, this number is 27.44 for the best non-TDA models.

As TDA models appear to be sensitive to the selected hyper-parameters, we turn our attention to these values. For increasing threshold values, recall decreases from 0.53 (q = 0.0) to 0.37 (q = 0.9), but precision and PLR values do not change. In Figure 8, we present model performance for ϵ1 (i.e., threshold on how many past ransomware addresses should be contained in the cluster) and ϵ2 (i.e., threshold on at most how many addresses the cluster can contain). As Figure 8 indicates, TDA models tend to be more sensitive to ϵ2 values. By limiting ϵ2 to small values, we reach higher precision on average. However, this leads to fewer predictions, as ϵ2 defines at most how many data points can be grouped together (see sec. IV-B).

Our results indicate that making predictions in every time window may not be desirable, as it leads to lower F1 values. Both TDA and DBSCAN can ignore some data points, thereby making predictions only when enough data exists. As we can control the sensitivity of TDA models through hyper-parameter selection, a natural question is to ask whether more restrictive settings should be preferred in TDA. Figure 9 depicts TDA performance when settings allow for lower (i.e., fewer windows) or higher (i.e., more windows) sensitivity. Precision, recall and F1 values are averaged for TDA models. We find that less sensitive models yield higher precision, but lower recall values. Cerber results show the greatest change in terms of sensitivity, whereas Locky has the best and most stable results. For some families, such as CryptXXX, the least sensitive models deliver the highest precision values. These findings imply that some TDA models are able to make predictions for very few windows only, but the results can be very accurate.

B. New Family Prediction

We outline our experimental settings for the detection problem as follows. Given features X_t and known labels Y_t of past addresses at time t and features of addresses observed between windows t − l and t:

By using the ground truth data at t′, take a sample of M=1000 white addresses without replacement to be used in

For t < t′, use a training length l, and create a dataset X_{t-l,t} ⊆ X_t which holds features of addresses observed between windows t − l and t.

Create an f_0 sample of size N from X_{t-l,t}^0 ⊆ X_{t-l,t} without replacement where ∀x_u ∈ X_{t-l,t}^0, y_u = f_0 and N = ⌊X_{t-l,t}^0⌋.

Create a ransomware sample of size N from X_{t-l,t}^rs ⊆ X_{t-l,t} without replacement where ∀x_u ∈ X_{t-l,t}^rs, y_u ≠ f_0 and N = ⌊X_{t-l,t}^rs⌋.

Relabel all addresses in Y_{t-l,t}^{rs} with the label f_r.


Useful information for enthusiasts:

Contact me via Telegram: @ExploitDarlenePRO