How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

09.02.2024
How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

The story began when one Alistair Milne tweeted that he planned to give away 1 Bitcoin in a wallet created using a 12-word mnemonic to the first person to crack the mnemonic. .

– It is known that for 8 known words there are 2⁴⁰ (~ 1.1 trillion) possible mnemonics.
– To test one mnemonic, we must generate a seed from the mnemonic, a master private key from the seed, and an address from the master private key.
OK. Sounds easy. Let’s get down to business.
– I wrote a processor version in Rust to evaluate the performance of the processor solver. My Macbook could only check ~1250 mnemonics per second, which means it would take about 25 years to check all 2⁴⁰ possible mnemonics.
– I ported all the necessary code for generating and validating mnemonics (SHA-256, SHA-512, RIPEMD-160, EC Addition, EC Multiplication) to OpenCL C, a programming language for running code on the GPU.
– The GPU version could check ~143,000 mnemonics per second, meaning it would take ~83 days to check all 2⁴⁰ possible mnemonics.
– I wrote a server application that will organize the distribution of work into packages of ~ 16 million mnemonics for a pool of GPU workers. Each worker GPU will request the server for the next batch of work, perform the work, and write the result back to the server. – I spent ~$350 renting GPUs from extensive.ai (plus ~$75 free from Azure).
– I was worried about other people doing the same thing, so I included a huge 0.01 BTC fee in the pre-arranged transaction.
I didn’t think even that would be enough, and thought there might be a “race to zero” where people would continually increase fees, trying to force miners to include their particular transaction in the next block.

And now in more detail how it all happened:

Alistair Milne tweeted that he planned to give away 1 bitcoin in a wallet created using a 12-word mnemonic.
I assumed this meant it was generated using the BIP-39 mnemonic and this was later confirmed when he provided the first few initial words and eventually the entire list of BIP-39 words in the tweet.
The BIP-39 mnemonic is generated using words from a fixed list of 2048 potential words. Each word is represented by an integer from 0 to 2047, corresponding to its position in the list. In binary, you need 11 bits to represent a number up to 2047.
Therefore, to represent a 12 word mnemonic you would need 12 * 11 or 132 bits. What is known is that BIP-39 uses 128 bits of randomness and then uses SHA-256 to generate a 4-bit checksum that is added to the end of the original 128-bits to produce the required 132 bits for 12 words.
This means that if we want to repeat all possible mnemonics from 12 words, we need to count from 0 to 2¹²⁸-1, where each number can be interpreted as a single mnemonic, then converted to an address, and then compared with the same address containing 1 BTC. At first glance this seems quite simple, but we quickly realize that this number is too large for simple brute force.
Luckily we don’t have to actually test all the possibilities of 2¹²⁸ because Alistair was going to release 1 seed every couple of days. Each seed word we collect reduces the variants we need to test by a factor of 2¹¹ or 2048.
He also mentioned that he was going to release the last 3 or 4 words at once to prevent brute force (or so he thought), so that meant that I would need to be able to use brute force on at least the last 4 words for my plan worked, and I would have had enough time to claim the prize.

With 8 words, this means we know 8 * 11 or 88 bits out of the 128 bits we are trying to calculate. This means that there are “only” 2^(128–88) or 2⁴⁰ possible mnemonics that we would need to test. That’s 1,099,511,627,776 or roughly 1.1 trillion possible mnemonics.
I wasn’t sure how long it would take to try out 1 trillion possibilities, so I wrote a quick program to get a basic test so I’d have some idea of ​​what I’m dealing with and if I thought it was even possible to do.

The strategy I was going to use was to calculate the start and end number that I needed to iterate over based on a set of known input words. For each number, I calculated the address corresponding to that number and then checked to see if the address was the one that contained 1 BTC. If it were an address, I would create and send a transaction to transfer funds to a wallet that I control.

The first version I wrote in Rust used existing libraries (rust-bitcoin, rust-wallet and ring) to handle all the hashes and elliptic curve math. Running some quick tests showed that using the CPU for this is not practical.

My laptop (dual core 2.5GHz Intel Core i7) could only do ~1250 address check mnemonics per second, or about 108,000,000 per day. This means it would take my processor about 25 years to generate and test the 1 trillion possibilities needed to crack the mnemonic, knowing only 8 words.
To achieve this in 1 day, I would need to improve performance by about 9000 times the current speed.

My next attempt was to rent a more powerful machine to see how fast the same CPU-only version could run. I rented a 32-core CPU optimized machine from Digital Ocean and was able to record a ~8000ps benchmark, which is only ~6x better than my laptop.
To do this I would still need about a 1000x increase in performance. I didn’t think I could get close to this by trying to optimize the CPU code since it was probably already pretty well optimized in the existing libraries I was using.

Enter the GPU
Over the past decade, there has been an increase in the use of GPUs to perform general-purpose programming. In Bitcoin, we saw this relatively early when people started using GPUs to perform the operations required for mining.
So how exactly does a GPU help us solve this problem faster? It turns out that a single GPU core is actually slower than its CPU counterpart when used for general purpose programming. Performance gains are usually seen when you can effectively parallelize a program. This is because a single GPU device typically has thousands of cores that you can use for your computing.
Luckily for us, our task parallelizes very well. Each of the 2⁴⁰ numbers we want to test performs the same calculation (number -> mnemonic -> seed -> master private key -> address). This means we can give each GPU core 1 try number and can run thousands of tries in parallel.
I quickly learned about OpenCL, a standard open-source programming language for writing software that will run on virtually any device with a GPU. OpenCL C is very similar to the C programming language with some differences. One of these differences is how memory works. There are four main types of memory available to you in a GPU (global, persistent, local, and private). Global memory is shared across all GPU cores and is very slow to access, you want to reduce its usage as much as possible. Permanent and private memory are very fast, but limited in space. I believe most devices only support 64KB of persistent memory. Local memory is used by a “group” of workers, and its speed is somewhere between Global and Constant.
My goal was to fit everything I needed into 64 KB of persistent memory and never read from global or local memory to maximize program speed. This turned out to be a little tricky because the standard secp256k1 pre-computed multiplication table itself was exactly 64 KB. Luckily, I was able to precompute a smaller table that was only 32KB, but ran about 75% slower than the full table. The BIP-39 wordlist took up another ~20 KB, and the SHA2 hashes took up another ~6 KB, so I’ve already used ~58 KB of the 64 KB available to me to begin with. This left me with about 6KB of wiggle room to work with. Ideally, I could do all the calculations on the GPU. This meant number -> mnemonic -> seed -> master private key -> address calculated by the GPU and written to OpenCL.
What exactly needs to be implemented to handle each of these steps? Let’s dive into each step we need to take to get from a number to a Bitcoin address. If you’re not interested in these details, just skip the article to the Implementation in OpenCL section.

Converting a Number to a 12-Word BIP-39 Mnemonic
Let’s look at an example of how you can convert a number to a 12-word seed.
First, let’s start with a really big number:
34,267,283,446,455,273,173,114,040,093,663,453,845
From here we need to convert this number to a 128-bit number in binary.
000110011100011110100011100000111101001100010111000110010010001001111101010010001010100000111111000001110010101001100010100
10101 We can get the last 4 bits (checksum) by computing the SHA-256 hash of this value and taking the first 4 bits of the result. In this case, we get a checksum of 0101.
Now we add the checksum to the end and split our 132 bits into groups of 11 bits:
| 00011001110 | 00111101000 | 11100000111 | 10100110001 | 01110001100 | 10010001001 |
11110101001 | 00010101000 | 00111111000 | 00111001010 | 10011000101 | 00101010101 |

We then convert each group of 11 bits into a number representing the index:
| 206 | 488 | 1799 | 1329 | 908 | 1161 | 1961 | 168 | 504 | 458 | 1221 | 341 |
Finally, we use them as indexes into the English BIP-39 word list to find each corresponding word:
border dial thought plastic huge bun bright bench disease deer obvious click Here’s how we can map any number into a 12-word mnemonic. This step only costs us 1 SHA-256 computation.

How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

Mnemonic in Seed

The next step is to take this 132-bit mnemonic string and use it to generate a 64-byte binary seed. How to expand a 132-bit string to 64 bytes?
BIP-39 does this by using a password-based key derivation function with HMAC-SHA512 as the hash function, a mnemonic string as the salt, and a 12-word mnemonic as the password. It also uses 2048 iterations, and each iteration requires two SHA512 calculations. This means that this step will cost a total of ~4096 SHA-512 computations.
This is similar to how many websites store hashed passwords in their database. The main idea is to slow down the process of guessing a large number of passwords when trying to crack someone’s password hash. You can control the time it takes to verify a single password by increasing the number of iterations or using a slower hash function such as scrypt or bcrypt.

How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

Seed to BIP-32 master private key

Once we have the seed, we need to convert it to a BIP-32 (HD) wallet. You can read the full BIP for all the details, but at a high level, BIP-32 defines a way to create a master private key from a seed, and then use that master key pair to generate up to 2⁵¹² child key pairs.
This is a great solution for creating wallet software as it allows the user to easily back up a single secret (their mnemonics) but still be able to generate nearly infinite addresses (for all practical purposes). It has other nice benefits where you can generate child public keys without having to have private keys (great for businesses that need to generate receive addresses without having to have private keys on their server).
To convert our seed into a master private key according to BIP-32, we need to calculate HMAC-SHA512 (“bitcoin seed”, mnemonic_seed). HMAC-SHA512 produces 64-byte output. We take the first 32 bytes as the master private key, and the remaining 32 bytes are used later as a “chain code” to “extend” the key when generating child key pairs. Computing HMAC-SHA512 from a 132-byte input requires only two SHA-512 computations.

How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

Master Private Key for Address

This step involves retrieving our BIP-32 master private key and retrieving the child key pairs based on the acquisition path needed to access the address that contains the bitcoins. If we look at the address in Explorer, we can see that it was generated using BIP-49 P2WPKH-nested-in-P2SH.
This means that the derivation path has the format m/49’/coin_type’/account’/change/address_index.
Determining the derivation path was a huge risk for this project. I assumed that Alistair simply generated a new wallet and the only transaction was to deposit 1 BTC. With this assumption, this means that the derivation path for the first address would be m/49’/0’/0’/0/0.
This means we need to take the master private key and generate three secure private keys and then two regular private keys.
Each secure private key requires HMAC-SHA-512 computation (2 SHA-512 hashes) and one secp256k1 scalar addition.
Each normal private key has the same requirements as a strong key, plus the need to calculate the associated public key from the private key. To calculate the public key, we need to perform an elliptic curve multiplication of the scalar represented by the private key by the G point of the secp256k1 group generator.

The last step is to take the calculated public key and convert it to the P2SHWPKH address. This involves creating the correct script and then using hash160 (RIPEMD-160 followed by SHA-256) to obtain the address, and then SHA256 (twice) to calculate the 4-byte checksum.

In total, this step costs 10 SHA-512, 3 SHA-256 and 1 RIPEMD-160 hashes. It also costs us 5 EC scalar additions and 3 EC multiplications.

Total Cost

We have to do all these steps for EVERY mnemonic we want to try:
Number to mnemonic – 1 SHA-256
Mnemonic to seed – 4096 SHA-512
Seed to private key – 2 SHA-512
Private key to address – 10 SHA- 512, 3 SHA-256, 1 RIPEMD-160, 5 EC adds, 3 EC multiplies
At first glance, it appears that the seed generation step will be the slowest, although without some testing it is difficult to know how SHA-512 hash compares to EC operations in terms of cost. It turns out that both are relatively slow compared to the other steps, but seed generation is at least an order of magnitude more expensive than the others.

How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

Implementation in OpenCL

So, to implement all this flow in OpenCL, I will need a way to do SHA-256, SHA-512, RIPEMD-160, EC addition and EC multiplication. I would also need to organize all of this together to solve my problem.
My strategy was to find open source C implementations of all the algorithms and port them to OpenCL C. I started with SHA-256 and SHA-512 because they were the only ones that allowed me to calculate the number -> mnemonics -> seed , and I knew that creating the seed was the slowest part.
I eventually got seed generation running in OpenCL, and my first test using an nVidia 2080Ti showed me that I could generate 142,857 seeds per second. Wow! Now we were going somewhere.
This meant that a single nVidia 2080Ti could generate ~12 billion seeds per day. This still meant that it would take 83 days to create just one trillion seeds. This was much better than the 25 years my processor was going to take. However, this still only generated seeds, and the seed was not enough to know whether the mnemonic was correct or not. I still need to complete the seed to address conversion process to be able to check if it is the correct mnemonic.
I then went back and tested my CPU’s version of the seed to address generation to see how long it would take. Digital Ocean’s 32-core machine could process about 52,000 seeds per second. This was pretty decent, but after improvements to OpenCL seed generation it became a bottleneck and would still take over 221 days to complete. Additionally, I would need to coordinate moving the seeds generated from the GPU back to the CPU to finish processing them. This coordination will require time and a more complex program to write.
My goal was to move the entire calculation to the GPU.
This meant that I needed to be able to do EC math (addition and multiplication) in OpenCL. I took the open source libsecp256k1 implementation that Bitcoin uses and ported it to OpenCL C. To do this, I needed to first understand how the libsecp256k1 library was structured and what exactly I needed to port in order to generate an address from the master private key.
It turned out that I needed to port about 2000 lines of code. Luckily, OpenCL C and standard C are similar enough that no major changes were needed. The changes were that I implemented some memory management functions such as memcpy, memset and memzero, which OpenCL does not support. This also involved me removing the blinding that occurs when doing the EC multiplication and pre-calculating a 32KB table instead of the default 64KB table.
After all was said and done, I ran some sanity tests and was surprised to find that my OpenCL implementation actually worked correctly. I then reran the tests using the 2080Ti and saw that the time added to calculate the address from the seed using the GPU was negligible. This added only a few hundred milliseconds per 1 million seeds. I was now able to run the entire process 100% on the GPU, but it still took ~80 days to list the 1 trillion possible mnemonics using the 2080Ti.
I then tried running it on several different graphics cards (1080, 1080Ti, 2070, 2070Ti, Tesla K80, Tesla P100 and Tesla V100). To my surprise, the performance of the GPUs did not change much. Even the top-of-the-line Tesla V100 was only 15% faster, but cost nearly 4 times as much to rent.
On the lower end, the 1080/1070 were about 3-4 times slower than the 2080Ti, but only cost about half as much. The 2080Ti seemed like the most cost-effective card to solve this problem. How much can I get and how to organize it all? If I solved this in 24 hours, I would need about 80 watts of the 2080 Ti.

How I Checked Over 1 Trillion Mnemonics in 30 Hours to Win Bitcoin

GPU Pool Orchestration
If I wanted to distribute this problem across multiple GPUs, the simple answer is to simply split the 2⁴⁰ numbers we need to run into 80 equal parts and run each part on one GPU. Unfortunately, things won’t be that simple. First I had to figure out how I was going to access all these machines. My first thought was to use the major cloud providers (AWS, Google Cloud and Microsoft Azure). I quickly learned that these companies have strict quotas on the number of GPUs you can supply (some of them ZERO!) with a new account.
Luckily, I came across a GPU marketplace (Vast.ai) that allowed people who had unused GPUs to rent them out to anyone who wanted access to the GPUs. They had a good supply of 1080 and 1080 Ti, but not as much 2080 Ti as I needed. Inventory fluctuated constantly depending on how many people were renting them, how much they were willing to pay, and how many providers were online at the time.
This meant that I couldn’t easily allocate exactly 80 2080Ti’s and distribute my workload evenly between them.
I ended up creating a simple centralized server that would distribute the work. This is very similar to how a mining pool works. Each GPU worker will make a request to a centralized server to execute a batch of work, perform the work, and then write the work (and a solution if found) back to the server. Each worker will continue to do this in a loop until there is no more work to do. This meant that I could easily spin up as many cards as I wanted, as quickly as I wanted, and the central server could keep track of what the next batch of work would be when one was requested. Each worker didn’t need to know anything about what piece of work it needed to do, it could just blindly ask the server for the next batch to work on.
This solution does increase network latency. I needed to make the packet size large enough so that the additional network latency would be a small percentage of the total time. I ended up using a batch size of 16,777,216, which means 65,536 batches of work to calculate all 2⁴⁰ possible mnemonics.
It took just under 2 minutes to calculate a package of 16,777,216 2080Ti. This means that a network latency of less than 1 second added less than 1% additional computation time.

Testing the System
It ended up being a more complex system than I had originally anticipated, and I was worried that there might be a bug or that it wouldn’t work properly when the time came. I ended up doing a full test to make sure everything actually worked.
I created a wallet with the new BIP-39 mnemonic and transferred 0.0001 BTC into it. I then initialized my system with 9 known seed words (so it wouldn’t take as long or cost me that much). I’ve rented a pair of 2080Ti’s in large areas and let them rip. Within 20 minutes he found a solution and moved 0.0001 BTC to the Trezor wallet that I controlled. I felt prepared, although I wasn’t sure I could get enough processing power in time to complete the task when it was needed.
I felt there was still a way to optimize the SHA-512 code I was using when trying to port the SHA-512 version that hashcat was using since they have an extremely optimized version. Since SHA-512 was the most commonly used method and the current bottleneck, any improvements made here could dramatically reduce the number of GPUs needed. I was about halfway through implementation when Alistair released the 8th word.

The Big Day

I immediately threw out the optimized SHA-512 I was working on and went back to the version I knew I had worked with before. I started renting as many 2080Ti, 2080, 2070, 2070Ti, 1080Ti as I could from huge.ai. In the meantime, I was able to increase my GPU quota in Azure (+ $200 free credit for new accounts) to rent up to 40 GPUs there. Unfortunately, the machines I had access to were about 50% more powerful than the 2080Ti. However, this meant that I was able to get about 20 2080Ti’s for free from Azure alone.
At its peak, I was testing about 40 billion mnemonics per hour. This means that it took about 25 hours to check 1 trillion mnemonics. I knew that on average it would only take 50% of the time (depending on what the ninth word actually was). If the word starts with A, then it will finish in 1 hour, and if it starts with Z, then it will take a full 25 hours. Of course, my testing speed hasn’t been consistent at 40 billion per hour as machines on vast.ai come and go as people interrupt me or disconnect. I had to constantly look through the list of available cars and rent more as they became available.
During the day, when no solution was found, I became worried that it wouldn’t work because my approach could have failed for many reasons:
– I assumed that the words it was releasing were in the right order. If they weren’t ok, there would be 8 of them! (factorial) has more options (making 8 words nearly impossible to brute force) and my code didn’t try different permutations anyway. – I assumed that I said all 8 words correctly. While most of them were obvious, there were a couple where I felt there might be other options. I didn’t try any of these options, only 8 that I thought were correct.
– I assumed that he used the first address of the first HD wallet account, obtained at m/49’/0’/0’/0/0. If it had used any other origin path (second or later address), I wouldn’t have found it. I was REALLY nervous about this because I couldn’t know exactly what derivation path he was using. Luckily, he used a completely new wallet without creating any additional addresses before depositing 1 BTC.
After a full day of work, the status of my production server showed that it was about 85% tested for all features, and I had pretty much given up hope that it would work. I literally almost turned it off at this point to implement a version that tested more than just the first address because I was convinced the assumption was wrong.
I couldn’t bring myself to really stop it at that point since I had gotten that far, so I just let it continue. To my surprise, a little later that evening (91%), and almost 30 hours and exactly 1 trillion checks (1,000,710,602,752) later, he found the solution!
I couldn’t believe it worked. I nervously connected my Trezor to check my balance to make sure I hadn’t messed up the code I had generated and signed the transaction to clear the BTC. To my relief, it was 0.99 BTC!
Then I nervously waited for confirmation. I was worried that another person (or Eve Alistair) would try to steal my bitcoin by continually increasing the miner’s fee so that the miner would include his transaction in his. This is one of the reasons why I started with a relatively high fee (0.01 BTC). I didn’t have time to implement a tool that monitored the mempool for competing transactions and automatically increased my commission, but I thought it would be a good idea in the future.
However, after a few minutes my transaction was included in the block. Wow, it really worked!

Final Thoughts I had a lot of fun and learned a lot working on this problem. I’ve been thinking about ways to run these kinds of sweepstakes while avoiding the possibility of someone writing software to win. It’s actually not that simple.
Even if it decided to release the last 5 words at once to prevent brute force, I could still have software running waiting for me to enter each word as I solved them, assuming it was some kind of puzzle , and automatically swept up coins faster than any human.
I thought it could use a deeper derivation path, and although my original version of the software would have missed this, it would have been trivial and not much slower to check multiple derivation paths for the address being used. I think the problem here even for humans, the wallet software that handles the recovery process only scans a small number of unused addresses before stopping, so it couldn’t go deeper into this if it wanted users to be able to use a regular wallet to recover it .
Another idea is to break down the words in order, which would make brute force impossible early on, but still doesn’t stop me from using software to list all possible orders and sweep up btc faster than a human could ever hope to do that’s one day. Enough words have been released.
After all, software will always be faster than humans at doing these kinds of tasks. I think the only real way is to make discovering the actual words the hard part. It’s a kind of puzzle where solving it gives you all the words at once, so the race is more about solving the puzzle than how fast you can type the words.


Useful information for enthusiasts:

Contact me via Telegram: @ExploitDarlenePRO




via the link:
https://t.me/ExploitDarlenePRO