blockchain coding

So you want to know about blockchain coding but maybe don’t know where to start. In this article, we’ll explore introductory information about the most commonly used languages in blockchain and general programming concepts. By the end, you’ll be able to write your own blockchain from scratch in a few lines of Python code!

It’s advisable to learn to code and understand at least the basics of programming before diving into this article. But anyone with minimal or even no knowledge can get a well-rounded introduction since each aspect will be thoroughly explained.

Languages

Blockchain technology is an exciting and rapidly developing industry. As such, the vast majority of projects are developed in an opensource manner. This means the code is available for anyone to copy, modify, or redistribute. The original blockchain, Bitcoin, is a prime example of open source development as Bitcoin-core, the most popular Bitcoin client, currently has over 600 contributors from around the world.

Because of the nature of opensource development, many different languages are used. There’s is no official or standard language to use. In fact, you could write blockchain code in just about any computer language as long as it is Turing complete (i.e. it can solve any computational problem). Funny enough, an in-game engineering system in the popular game Minecraft called redstone can be used to create a Turing machine. And Microsoft’s PowerPoint is also Turing complete if you try, really, really hard.

Since blockchains are abstract pieces of data, any Turing complete language could hypothetically interact with the network.

Best options for blockchain coding languages

But before you try to write a blockchain on powerpoint or a smart contract with 3D blocks, we should consider the more practical options that have been more empirically proven and are widely used by opensource projects. Most of the common languages you have probably heard of are Turing complete. But some languages make it easier to do certain kinds of projects than others.

Languages like Ethereum’s Solidity were developed specifically for blockchain. Solidity helps in that it allows for more Object Oriented Programming, which makes it much easier for people to read and write the language. This is a super important characteristic of good and maintainable software, since you often have to go back to code to make modifications. So you always want it to be self-evident what the code actually does.

We can go to Github to find out which blockchain coding languages are used in the client software of most blockchains. That way we have a good idea of where to start.

blockchain coding

Finding the most popular coding languages

Looking at the top 10 blockchains on Coinmarketcap.com (excluding tokens and stablecoins), we can sort the blockchain client languages by popularity. This results in C++, C, Python, Java, and then everything else. You may have noticed that Solidity, Ethereum’s custom language, is not on this list.

If you are working in the blockchain industry, however, it is likely that you would be working on a project on Ethereum and therefore be working with Solidity. About 45 of the top 100 projects on Coinmarketcap are tokens. The vast majority of these are on Ethereum. But this article is more about the basics of blockchain coding with the end goal of understanding how it works on a technical level.

So if you are looking to become a developer in the industry, you would probably want to put Solidity at the top of that list followed by the others. Because of course you will have to learn multiple languages as all developers do across all industries.

Let’s continue with a little history on these languages.

C

C is one of the early general purpose programming languages. It was initially created in 1972 from an earlier language called B. The B language was a bit slow and lacked some features. So developers of Unix operating system at Bell Labs, Ken Thompson and Dennis Ritchie, created the successor C, which had both speed and useability baked right into the design. Over four decades later it is still one of the most popular languages around. Additionally, it’s used in countless applications and operating systems. Like most languages it has been updated a few times since then.

C++

C++ was invented in 1979 by a Danish computer scientist named Bjarne Stroutstrup. Bjarne took features and ideas from an older language called Simula and then combined it with the C language. Although it did have influences from other languages as well. The combination of these languages was originally called “C with Classes” but after further revision was renamed to C++.

C++ was the language which Satoshi Nakamoto originally used in the first implementation of Bitcoin.

Python

Python is a high level and interpreted language with strong emphasis on readability and whitespaces. Being an interpreted language means that when the code is run, it is translated into an intermediate language like C or bytecode. The advantage of this is that you can significantly reduce the amount of code you need to write and it removes redundant or superfluous declarations of things like variable types, which is common in languages like C. It was released in 1991 by Guido Van Rossum. Python actually got its name from the British comedy sketch group Monty Python!

Python has risen significantly in popularity in the last few years. This is particularly because it’s a lot quicker to write programs in when compared to languages like C. Generally, you don’t need to worry about memory allocation and other small details. But also it’s because of the rising popularity of data science and machine learning applications, for which Python has tons of fantastic libraries.

Java

Like Python, Java is another object oriented language and was created around the same time. It was originally created for interactive television. Turns out it was a bit too complicated and advanced for the industry at the time. Eventually, its name was changed from Oak to Java after Java coffee.

Java syntax is very similar to C. The similarity was intentional so that experienced programmers in the industry could easily transition and pick it up without any major hurdles. Java was designed with the philosophy of “write once, run anywhere.” So that Java code could easily run across many platforms without the need to recompile. This is a great feature for a language to have if you are developing applications for decentralized networks like the internet or blockchains since there are many different types of machines and platforms that will connect to these networks.

Solidity

We have to mention Solidity in this article. It is the de facto smart contract language, even if we aren’t going to use it in our blockchain. Solidity was designed to be an easy-to-use object oriented and high-level language. Its syntax is completely new, but not that different from languages like JavaScript (not to be confused with Java).

It was initially proposed by Gavin Wood in 2014 and then was further developed by the Ethereum team.

What is a Blockchain Technically?

To oversimplify, a blockchain is very much like a linked list.

First, to understand what a linked list is, you should know what a pointer is. A pointer is an address to a specific place in your computer’s memory. If you have ever programmed in a language like C or C++ before, you should be fairly familiar with pointers already. In practice, they are a bit tricky to get a hang of but they allow you more granular control over how your program may use memory. This can be useful in making your program run a lot more efficiently.

Pointers can reference other pointers, structs or basically any other data type. We won’t have to worry about pointers in this article beyond understanding what a linked list is. This is because we will be using Python, which is called a “high level” language. A lot of the nitty gritty stuff is simplified and hidden like all of the referencing and handling of pointers. It’s done automatically by the language and is abstracted away to make the code easier to read and write.

What is a linked list?

A linked list is just a sequence of elements or objects that are “linked” together by pointers. Each element is made up of two pieces. One piece is the data you are trying to record and the second piece of data is a pointer or “link” to the next element.

Unlike other types of more typical lists used in programming like arrays, you have to iterate through each item sequentially so between that and the extra space in memory needed for the pointers they can be a bit slower. With arrays (another type of list), you can just use an index to access a specific item you want to use because of the way they allocate space in memory.

Cryptographic hash pointers

With a blockchain, instead of using a regular pointer as a reference, it uses a cryptographic hash pointer, which contains the hash of the previous block. A hash is the output of a hash function which is normally called a “one-way function.” It allows you to input some piece of data and get a unique output of a fixed length.

You can’t reverse the data from the output hash very easily. But you can easily prove that a hash is correct or that your data has not been tampered with by putting your data into the hash function and verifying that the data does in fact map to that specific output of the hash function. And so using this, you can include the hash of the previous block, as well as other pieces of data such as transactions and timestamps.

When we have a really good hashing function that has all of the right qualities, we can see that changing even a small piece of the input data will give us a very different output hash.

The properties needed for a good hash function do vary slightly. It depends on the specific application you are using it for. But here are some of the general attributes that you would want to have.

Compression

Hash functions must produce a fixed length output. No matter the length of the input string (also called the “plaintext” input or “message”), you always want the output to be the same. Even if it’s smaller than the output text, you would just add some sort of “padding” to the initial input.

Easy Computation

The hash function should be fairly efficient. It’s possible to create hash functions that have the other properties but are very inefficient and slow to compute. This just comes down to the practicality of using the function especially when using it over a decentralized network. We need as much speed as we can get wherever and whenever we can get it.

Preimage resistance

Preimage resistance means that even if you were to try various input texts or pre-image into your hash function, you likely won’t find a correlation or mapping to a specific output character or combination of characters.

Collision resistance

Collision resistance means that two unique inputs will not produce the same output hash. Or, that it’s extremely improbable to find two such inputs.

Near-collision resistance

Similar to collision resistance, but it should be unlikely to find two output hashes that are even somewhat similar.

Non-correlation

The input text is not correlated in any way to the output text. Ciphers like the Caeser cipher (discussed below) would not fulfill these criteria. This is because each element is correlated 1:1 to an element in the output text. Therefore, you should not find any correlation.

How to write your first blockchain in Python

We will be using trinket.io to write a proof of work blockchain from scratch in Python 2 all in browser!

There are definitely secure libraries with the standard Bitcoin hashing algorithm SHA-256 among others that are available in Python. One of these standard libraries is called hashlib. If you are not aware of what a “library” is, it’s basically just a bunch of pre-built functions or small programs that are usually opensource. That way, rather than having to reprogram something from scratch every time, you can just do an import of a library. No need to reinvent the wheel. Libraries are used in virtually all programs and programming languages.

Understanding hash functions

But it’s important to have an understanding of what hash functions actually do. It’s also good to know what makes one secure as well as understand some basic cryptography. So we will write our hashing function from scratch.

Again we will assume you have some programming knowledge. But feel free to follow along without that knowledge. You can also learn the basics of Python at learnpythonthehardway.org. But here is a simple program demonstrating some of the basic syntax of Python. It should be a good way for you to get a feel for it if its new to you.

Try running the code and see what happens.

Basic cryptography

With blockchain coding, it’s first things first: the hashing function. There are libraries that you can use that already have standard hashing functions. But for the sake of learning, we will write some from scratch. Of course, it’s going to be super simple and not secure, so I don’t recommend using it for anything! But it’s important to understand how everything works and to have a grasp of basic cryptographic principles if you are going to be learning how to build a CRYPTO-currency and blockchain.

To understand hashing, you need to understand cryptography and look at some basic and early encryption algorithms. We’ll start off with the Caesar cipher and the Vigenère cipher.

Caesar cipher

The Caesar cipher is named after Julius Caesar because he apparently used this cipher in some of his own correspondents. It is one of the earliest and most well-known encryptions. It works quite simply by shifting each letter by a “shift” which can also be referred to as a key. For example, if the shift is 1 then if you input some plaintext like “abc” the Caesar cipher will output “bcd” because each letter is shifted by one index in the alphabet. You can change the shift to an index you have in your alphabet to encrypt it differently.

During Roman times when literacy rates may have been quite a bit lower than in the modern world and copying of documents could not have been done with a quick CTRL-C and encryption wasn’t a widely known idea the Caesar cipher may have been sufficient for such an application. But obviously in the modern world when a computer can do billions of operations per second it’s not a very tall order to try 25 possible character shifts until its broken. This Cipher could be broken in less than a second on a regular computer. It could even be broken fairly quickly by hand.

Vigenère cipher

If we introduce more elements to our shift key, however, something interesting happens. If our key has several elements, then we can have a key like “btc” (using characters instead of the index here). And with an input of “abc we get the output of “bue”.

This cipher was wrongly called the Vigenère cipher due to a misattribution. It was actually created by Giovan Battista Bellaso, an Italian cryptologist in the 1500s. Not  Vigenère. It can be a secure cipher as long as the key is a completely random set of characters and the length of the key is at least as long as the input message you are trying to encrypt. Ciphers like this that cannot be cracked. They also have a one time use pre-shared key called a one-time pad cipher.

Now let’s do an implementation of these ciphers in Python. The only difference here was done for programming simplicity and that is that the alphabet we are using is the ASCII table. So the results will be slightly different than using the traditional ciphers and will utilize more characters.

Press run code and see what the output is!

If you can understand this code or create your own implementation, congratulations! You now have the power to encrypt and decrypt messages! Maybe play with it and see if you can come up with something on your own.

Hashing vs cryptography

Now, as it turns out there is a subtle distinction to be made between encryption and cryptographic hashing even though they share a lot of similarities. In blockchain coding, it’s important to differentiate. Encryption is meant to be reversible if you have the right key. It’s also generally used to send private messages, even if the channel is insecure. Whereas hashing is not reversible, even if you know the exact algorithm that was used, there is no information gain to be had from the output hash.

So there is also no key with these functions. Even inputting potential versions of the message shouldn’t give you a clue about the original input message. The hashes are all extremely unique and random. Moreover, the only way to know what the original message was is to input the correct message.

A simple hashing function

A very simple hashing function is taking the value of each input character on the index table and adding them up. So ABC = 1 + 2 + 3 = 6. The idea here is that you can’t easily figure out what exact combination of letters lead to that sum. Let’s see what that function looks like in Python. For the index, we will use the ASCII table instead of a regular alphabet.

If you run this code, will notice that the input “this” and “hits” actually produce the same output hash. This is bad and will not work. That’s because there are too many collisions! Let’s try again. But this time, let’s multiply each number by its index in the input string before we add it to the sum.

This is a lot better but still insufficient as similar inputs produce similar outputs due to the similar lengths of the input. Even if you couldn’t figure out exactly what the inputs were, you could gain some information about them.

A stronger hashing function

We need something much much stronger for a blockchain. So let’s take a crack at it and see if we can create something that’s reasonably strong.

Altogether, our hashing function is 90 lines of code. It’s not terribly complicated to understand if you are familiar with Python. Here is an explanation of how it works:

Hashing function explanation

First, in the hash() function we created an alphabet. This can be anything of your choosing. Like the ASCII table. In this example, it is just uppercase letters and digits. Any other characters are omitted and removed from the input string.

The next step is to either extend or compress our input message as we want a fixed length output string. We do this inside the chomp() function which will divide the string into multiple equal length parts if it’s too long, and then it simply adds each character from the new separated strings together by whatever their index was in the alphabet.

If the index is greater than the length of the alphabet then it simply rolls over back to 1 or A in this case and continues counting from there. And if the string is too short then we add what “padding” which in this case is just copying the alphabet to the input string until it is the desired length.

Once we are here, we already have more or less gibberish for a string. But it still may be possible to reverse and you might be able to observe some common patterns in the output. So now we will put it through an actual cipher or “mixer” algorithm.

Mixer algorithms

Here is the simple version of how it works. With an input text of “AB” the function returns A^2 + B + (index of A) x 2. This is in the character_map() function, which is the key part of the function.

We start our transformation of the plaintext by inputting the first two letters of the string. So A and B or character 1 and character 2. Then we modify the string with the new output and perform the action on the next pair of characters, overlapping. So now we do the transformation on character 2 and character 3 and so on and so forth until we have done it on all the characters. This interlinks each character together so that a small change anywhere in the text will have a downstream effect on the all other characters of the output hash.

Using multiplication and a rollover

The multiplications in the remapping of the characters are helpful in obscuring the input further. This is because it causes numbers and combinations of numbers to “blow up” and roll back over to 1. Since the combination of numbers and the multiplication could lead to huge or smaller numbers, the roll over obscures which it could be and makes it hard to know what the original index was even if you know the exact hashing function used.

So unlike the first hashing function, which just used the sum of ASCII characters, it could give you some indication as to the length of the input string as well as characters used. Because longer strings and higher characters would lead to a higher number. But using multiplication and a roll over makes it much much more difficult to know what the input was. You can’t tell if the sum of characters is large or small!

How the cipher works

That may seem a little complicated. And you might be able to create a simpler function. But it’s designed in such a way to take into account the combination of the current and the next character in the string of plaintext. This is important because it means that if we iterate the cipher multiple times over, the same text that’s the input of a character at the beginning of the text will eventually carry over to the next character of text, and so on and so forth. This continues until it affects all the characters in the text to some degree. So it should create very different outputs. This is even so if we have just slightly different inputs like “AAAA” and “AAAB” give outputs, which are totally dissimilar despite their inputs being 75% the same.

The avalanche effect

This is also known as the avalanche effect or butterfly effect. Even the slightest change in initial conditions will drastically change the end result. It is one of the important and desirable qualities of designing a hash function.

If we then run our message through several rounds of this, it becomes increasingly difficult to determine what the original message was. In our function, we will just leave the default value as 3 for efficiency. But the more rounds, the more secure (in theory).

Now that that hard part of building the hashing function is over, we can move on to actually more blockchain coding and creating our blockchain!

Coding the blockchain

So what we do here is start by creating a new object class called Block. It takes in an index, timestamp, and previous hash as initial variables, then simply combines them all together in a plaintext string. Then it runs them through the hashing function we wrote earlier.

Creating the genesis block

Next, we create a helper function called create_genesis_block() since we have to start our blockchain from somewhere. Then, another helper called next_block() which just iterates to a new block with a new date-time and message.

Then all we have to do is create the first block and put it in a list and then iterate for as many blocks as we want to create. Now we have a cryptographic linked list which may technically be a blockchain. However, we are still missing a critical element of the blockchain, which is the consensus algorithm. Who gets the right to add the next block? With this current design, it would simply be whoever has the fastest internet connection and can add blocks before everyone else. This would hardly be decentralized if at all. Not to mention there is little cost for someone who adds malicious transactions to a new block.

Coding a Proof-of-Work blockchain

We will replace our next_block() function with a function called mine(). This function will allow us to have a decentralized proof of work blockchain. Proof of work implements two important elements; computational power and randomness. The computation power ensures that miners are generally acting in good faith because they are burning electricity in order to confirm the next block. Since energy isn’t free there is a high cost to adding blocks and so the reasoning is that rational actors won’t burn money for no reason.

The randomness helps to keep things decentralized and fair. It’s not always the miner with the most computing power that wins the block reward. Even if that is the case on average, this gives incentive to the other smaller miners on the network. Because they occasionally still get something even if they aren’t always number one with overall mining power. The incentive for mining, which is the block reward, is the transaction fees + newly minted coins. This is the case on most proof of work blockchains.

How do the nodes agree?

So how do all the nodes on the network agree on who actually won the block reward? How does a miner know when it has won? Well, this is done through something called difficulty. It’s somewhat self explanatory, its how much energy is spent to find a block on average. The network will periodically adjust the difficulty to maintain a consistent interval between blocks because if more miners join or leave the network the total power being used would change and thus this would create a decrease or increase the average time to find a block.

Under the hood of Proof-of-Work

How does a miner win a block reward? By getting an output hash whose sum is less than or equal to the variable called the “target”. If your blockchains hashing function is designed properly and has the qualities listed earlier in this article, there should be virtually no way to compute or know in advance what input will produce a certain output. So the only way to get a hash with a sum that’s lower than the target is to literally try as many random inputs (block data + nonces) to the hashing function as you can. That is until you get a hash that has the values you want. Again, this random hashing takes time and energy or work, hence proof of work.

The sum of the output hash in our case will just be the index of each element added together. So if our output hash was:

ABC = 1 + 2 + 3 = 6

If our target was say 5 then we would have to try again. Now if we got lucky and our output hash was AAA:

AAA = 1 + 1 + 1 = 3

Reaching concensus

Now that our output hash is lower than the target, we would broadcast our input message. We broadcast this with the other important information to the network. Then the nodes can quickly verify that with the data we sent them, the output of the hashing function does in fact sum to 3. Now they all confirm it. Once 51% or more of the nodes confirm it, we have reached consensus, permanently adding the block to the chain.

This mechanism is quite clever because it utilizes the fact that the output of (good) hashing functions produces random, normally distributed set of sums. If you remember from taking a stats class a normal distribution or Gaussian distribution fits a bell curve which means its very predictable how often a random sample will lead to a certain outcome or in our case a specific sum.

blockchain coding

Each character of a hash is uniformly distributed. That means no character happens more frequently than any other character. If you increase the length of your output hash from N = 1 to N = 2, it would produce sums around the number 26 on average. If N was changed to 3, it would still be normally distributed. Just the average will increase slightly to 39.

Adjusting network difficulty

We know that the outcome is always going to be normally distributed. So we can easily predict the probability of a sum and therefore the time and energy needed on average to find a block.

With a network that only had 1 miner and if the average time to try 100 hashes was 1 minute and our difficulty was set to a 1 in 100 (1% odds), then our average block time would be 1 minute. If 99 more miners joined the network (and they all had the same computer and latency) then the mining power would increase 100 fold and therefore new blocks would be discovered 100 times a minute and the network difficulty would adjust so that any individual miner would have reduced odds from 1% to 0.01% chance of discovering a block and the blocktime of 1 minute is maintained.

Even though we linearly reduce the target sum the odds of getting that sum would decrease exponentially.

The beauty of Proof of Work (PoW)

This is very ingenious because no node on the network knows the exact hash of the next block until its discovered. Proof of Work makes sure everyone is honest purely by mathematics!

In proof of work implementations, if you are a miner you take all the data you want to add in the next block and then you add a randomly generated number called nonce to that data to give it a different variation. Then you run your data through the hashing algorithm. This way you can see if the output hash is lower than the target. If it’s not, you generate a new nonce and try again. Rinse and repeat until you discover a block! And that’s it! That’s how a simple mining algorithm works in blockchain coding.

You will see in our proof of work blockchain that the mine() function is very similar to the next_block() function, just instead it has a while loop which continually tries a new hash with a different nonce until it gets the right sum. With the parameters in this blockchain and a single miner, it takes about 4 seconds to find a block. If you change the target variable lower, it will take longer, and higher it will be shorter.

Your own Proof of Work blockchain!

So now we have an entire Proof of Work blockchain all in our browser in only about 170 lines of code made completely from scratch! Of course this doesn’t take into account many of the more complicated parts like wallets and actually running a node on a real network. If you want to take it to the next step, you can follow the snakecoin tutorial by Gerald Nash. This will lead you right through creating an operational blockchain from scratch in Python. Additionally, it’s also where some of the code here was adapted from.  

Blockchain Coding Resources

If after this you are still excited about blockchain development there are lots of online resources for learning to code blockchain.

Cryptography

Dr. Rob Edwards from San Diego State University has a good video series on Youtube hashing functions and programming them in C.

And if you want a more academic guide in building hash functions you can see this paper which builds hash functions with compression from block ciphers similar to the hash function we built in this article.

Standford and coursera also have a great cryptography course that you can take online and which rates 4.8 stars out of 5.

Programming

Code Academy

Code Academy is one of the best places to learn the basics of the more common languages. Many people will highly recommend it whether you are trying to learn programming or just trying to pick up the syntax of a new language.

Cryptozombies.io

Cryptozombies is a popular platform for learning the basics of Solidity.  It teaches you by doing coding puzzles in your browser to build a zombie Dapp. It’s definitely a fun way to learn Dapp development for Ethereum and its completely free!

Jamson Lopp Bitcoin development resources

Jamson Lopp has a great resource page for learning Bitcoin development specifically, but lots of the information will be applicable to other blockchains as well.

hedge fund