The Hash, the Merkle Tree, and Bitcoin

Although cryptocurrencies have been mainstream for quite a while, I still think the popular press has not done a great job explaining the details of how they work. There are several ideas behind a cryptocurrency like Bitcoin but the main one is the concept of a cryptographic hash. In simple terms, a hash is a way to transform an input sequence of characters (i.e. a string) into an output string such that it is hard to recreate the input string from the output string. A transformation with this property is called a one-way function. It is a machine where you get an output from an input but you can’t get the input from the output and there does not exist any other machine that can get the input from the output. A hash is a one-way function where the output has a standard form, e.g. 64 characters long. So if each character is a bit, e.g. 0 or 1, then there are 2^{64} different possible hashes. What makes hashes useful are two properties. The first, as mentioned, is that it is a one-way function and the second is that two different inputs do not give the same hash, called collision avoidance. There have been decades of work on figuring out how to do this and institutions like the US National Institutes of Standards and Technology (NIST) actually publish hash standards like SHA-2.

Hashes are an important part of your life. If a company is responsible, then only the hash of your password is stored on their servers, and when you type your password into a website, it goes through the hashing function and the hash is checked against the stored version. That way, if there is security breach, only the hash list is stolen. If the company is very responsible, then your name and all of your information is also only stored in hash form. Part of the problem with past security breaches is that the companies stored actual information instead of hashed information. However, if you use the same password on different websites then the hash would be same if the same standard was used. Some really careful companies will “salt” your password by adding a random string to it (that is hopefullly stored separately) before hashing. Or they will rehash your hash with salt. If you had a perfect hash, then the only way to break it would be to guess different inputs and see if it matches the desired output. The so-called complex math problem that Bitcoin solves before validating a transaction (and getting a reward) is finding a hash with a certain property but more on this later.

Now, one of the problems with hashing is that you need to deal with inputs of various sizes but you want the output to have a single uniform size. So even though a hash could have enough information capacity (i.e. entropy) to encode all of the world’s information ten times over, it is computationally inconvenient to just feed the complete text of Hamlet directly into a single one-way function. This is where the concept of a Merkle tree becomes important. You start with some one-way function that takes inputs of some fixed length and it scrambles the characters in some way that is not easily reversible. If the input string is too short then you just add extra characters (called padding) but if it is too long you need to do something else. The way a Merkle tree works is to break the text into chunks of uniform size. It then hashes the first chunk, adds that to the next chunk, hash the result and repeat until you have included all the chunks. This repeated recursive hashing is the secret sauce of crypto-currencies.

Bitcoin tried to create a completely decentralized digital currency that could be secure and trusted. For a regular currency like the US dollar, the thing you are most concerned about is that the dollar you receive is not counterfeit. The way that problem is solved is to make the manufacturing process of dollar bills very difficult to reproduce. So the dollar uses special paper with special marks and threads and special fonts and special ink. There are laws against making photocopiers with a higher resolution than the smallest print on a US bill to safeguard against counterfeiting. A problem with digital currencies is that you need to prevent double spending. The way this is historically solved is to have all transactions validated by a central authority.

Bitcoin solves these problems in a decentralized system by using a public ledger, called a blockchain that is time stamped, immutable and verifiable. The block chain keeps track of every Bitcoin transaction. So if you wanted to transfer one Bitcoin to someone else then the blockchain would show that your private Bitcoin ID has one less Bitcoin and the ID of the person you transferred to would have one extra Bitcoin. It is called a blockchain because each transaction (or set of transactions) is recorded into a block, the blocks are sequential, and each block contains a hash of the previous block. To validate a transaction you would need to validate each transaction leading up to the last block to validate that the hash on each block is correct. Thus the blockchain is a Merkle tree ledger where each entry is time stamped, immutable, and verifiable. If you want to change a block you need to change all the blocks before it.

However, the blockchain is not decentralized on its own. How do you prevent two blocks with two different hashes? The way to achieve that goal is to make the hash used in each block have a special form that is hard to find. This underlies the concept of “proof of work”. Bitcoin uses a hash called SHA-256 which consists of a hexadecimal string of 64 characters (i.e. a base 16 number, usually with characters consisting of the digits 0-9 plus letters a-f). Before each block gets added to the chain, it must have a hash that has a set number of zeros at the front. In order to do this, you need to add some random numbers to the block or rearrange it so that the hash changes. This is what Bitcoin miners do. They try different variations of the block until they get a hash that has a certain number of zeros in front and then they race to see who gets it first. The more zeros you need the more guesses you need and thus the harder the computation. If it’s just one zero then one in 16 hashes will have that property and thus on average 16 tries will get you the hash and the right to add to the blockchain. Each time you require an additional zero, the number of possibilities decreases by a factor of 16 so it is 16 times harder to find one. Bitcoin wants to keep the computation time around 15 minutes so as computation speed increases it just adds another zero. The result is an endless arms race. The faster the computers get the harder the hash is to find. The incentive for miners to be fast is that they get some Bitcoins if they are successful in being the first to find a desired hash and earning the right to add a block to the chain.

The actual details for how this works is pretty complicated. All the miners (or Bitcoin nodes) must validate that the proposed block is correct and then they all must agree to add that to the chain. The way it works in a decentralized way is that the code is written so that a node will follow the longest chain. In principle, this is secure because a dishonest miner who wants to change a previous block must change all blocks following it and thus as long as there are more honest miners than dishonest ones, the dishonest ones can never catch up. However, there are issues when two miners simultaneously come up with a hash and they can’t agree on which to follow. This is called a fork and has happened at least once I believe. This gets fixed eventually because honest miners will adopt the longest chain and the chain with the most adherents will grow the fastest. However, in reality there are only a small number of miners that regularly add to the chain so we’re at a point now where a dishonest actor could possibly dominate the honest ones and change the blockchain. Proof of work is also not the only way to add to a blockchain. There are several creative ideas to make it less wasteful or even make all that computation useful and I may write about them in the future. I’m somewhat skeptical about the long term viability of Bitcoin per se but I think the concepts of the blockchain are revolutionary and here to stay.

2021-06-21: some typos fixed and clarifying text added.

A Covid-19 Manhattan Project

Right now there are hundreds if not thousands of Covid-19 models floating out there. Some are better than others and some have much more influence than others and the ones that have the most influence are not necessarily the best. There must be a better way of doing this. The world’s greatest minds convened in Los Alamos in WWII and built two atomic bombs. Metaphors get thrown around with reckless abandon but if there ever was a time for a Manhattan project, we need one now. Currently, the world’s scientific community has mobilized to come up with models to predict the spread and effectiveness of mitigation efforts, to produce new therapeutics and to develop new vaccines. But this is mostly going on independently.

Would it be better if we were to coordinate all of this activity. Right now at the NIH, there is an intense effort to compile all the research that is being done in the NIH Intramural Program and to develop a system where people can share reagents and materials. There are other coordination efforts going on worldwide as well.  This website contains a list of open source computational resources.  This link gives a list of data scientists who have banded together. But I think we need a world wide plan if we are ever to return to normal activity. Even if some nation has eliminated the virus completely within its borders there is always a chance of reinfection from outside.

In terms of models, they seem to have very weak predictive ability. This is probably because they are all over fit. We don’t really understand all the mechanisms of SARS-CoV-2 propagation. The case or death curves are pretty simple and as Von Neumann or Ulam or someone once said, “give me 4 parameters and I can fit an elephant, give me 5 and I can make it’s tail wiggle.” Almost any model can fit the curve but to make a projection into the future, you need to get the dynamics correct and this I claim, we have not done. What I am thinking of proposing is to have the equivalent of FDA approval for predictive models. However, instead of a binary decision of approval non-approval, people could submit there models for a predictive score based on some cross validation scheme or prediction on a held out set. You could also submit as many times as you wish to have your score updated. We could then pool all the models and produce a global Bayesian model averaged prediction and see if that does better. Let me know if you wish to participate or ideas on how to do this better.

Catch-22 of our era

The screen on my wife’s iPhone was shattered this week and she had not backed up the photos. The phone seems to still be functioning otherwise so we plugged it into the computer to try to back it up but it requires us to unlock the phone and we can’t enter in the password. My wife refused to pay the 99 cents or whatever Apple charges to increase the disk space for iCloud to automatically back up the phone, so I suggested we just pay the ransom money and then the phone will back up automatically. I currently pay both Apple and Dropbox extortion money. However, since she hadn’t logged onto iCloud in maybe ever, it sent a code to her phone under the two-factor authentication scheme to type in to the website, but of course we can’t see it on her broken screen so that idea is done. We called Apple and they said you could try to change the number on her iCloud account to my phone but that was two days ago and they haven’t complied. So my wife gave up and tried to order a new phone. Under the new system of her university, which provides her phone, she can get a phone if she logs onto this site to request it. The site requires VPN and in order to get VPN she needs to, you guessed it, type in a code sent to her phone. So you need a functioning phone to order a new phone. Basically, tech products are not very good. Software still kind of sucks and is not really improving. My Apple Mac is much worse now than it was 10 years ago. I still have trouble projecting stuff on a screen. I will never get into a self driving car made by any tech company. I’ll wait for Toyota to make one; my (Japanese) car always works (my Audi was terrible).

Missing the trend

I have been fortunate to have been born at a time when I had the opportunity to witness the birth of several of the major innovations that shape our world today.  I have also managed to miss out on capitalizing on every single one of them. You might make a lot of money betting against what I think.

I was a postdoctoral fellow in Boulder, Colorado in 1993 when my very tech savvy advisor John Cary introduced me and his research group to the first web browser Mosaic shortly after it was released. The web was the wild west in those days with just a smattering of primitive personal sites authored by early adopters. The business world had not discovered the internet yet. It was an unexplored world and people were still figuring out how to utilize it. I started to make a list of useful sites but unlike Jerry Yang and David Filo, who immediately thought of doing the same thing and forming a company, it did not remotely occur to me that this activity could be monetized. Even though I struggled to find a job in 1994, was fairly adept at programming, watched the rise of Yahoo! and the rest of the internet startups, and had friends at Stanford and Silicon Valley, it still did not occur to me that perhaps I could join in too.

Just months before impending unemployment, I managed to talk my way into being the first post doc of Jim Collins, who just started as a non-tenure track research assistant professor at Boston University.  Midway through my time with Jim, we had a meeting with Charles Cantor, who was a professor at BU then, about creating engineered organisms that could eat oil. Jim subsequently recruited graduate student Tim Gardner, now CEO of Riffyn, to work on this idea. I thought we should create a genetic Hopfield network and I showed Tim how to use XPP to simulate the various models we came up with. However, my idea seemed too complicated to implement biologically so when I went to Switzerland to visit Wulfram Gerstner at the end of 1997,  Tim and Jim, freed from my meddling influence, were able create the genetic toggle switch and the field of synthetic biology was born.

I first learned about Bitcoin in 2009 and had even thought about mining some. However, I then heard an interview with one of the early developers, Gavin Andresen, and he failed to understand that because the supply of Bitcoins is finite, prices denominated in it would necessarily deflate over time. I was flabbergasted that he didn’t comprehend the basics of economics and was convinced that Bitcoin would eventually fail. Still, I could have mined thousands of Bitcoins on a laptop back then, which would be worth tens of millions today.  I do think blockchains are an important innovation and my former post-bac fellow Wally Xie is even the CEO of the blockchain startup QChain. Although I do not know where cryptocurrencies and blockchains will be in a decade, I do know that I most likely won’t have a role.

I was in Pittsburgh during the late nineties/early 2000’s in one of the few places where neural networks/deep learning, still called connectionism, was king. Geoff Hinton had already left Carnegie Mellon for London by the time I arrived at Pitt but he was still revered in Pittsburgh and I met him in London when I visited UCL. I actually thought the field had great promise and even tried to lobby our math department to hire someone in machine learning for which I was summarily dismissed and mocked. I recruited Michael Buice to work on the path integral formulation for neural networks because I wanted to write down a neural network model that carried both rate and correlation information so I could implement a correlation based learning rule. Michael even proposed that we work on an algorithm to play Go but obviously I demurred. Although, I missed out on this current wave of AI hype, and probably wouldn’t have made an impact anyway, this is the one area where I may get a second chance in the future.

 

 

AlphaGo and the Future of Work

In March of this year, Google DeepMind’s computer program AlphaGo defeated world Go champion Lee Sedol. This was hailed as a great triumph of artificial intelligence and signaled to many the beginning of the new age when machines take over. I believe this is true but the real lesson of AlphaGo’s win is not how great machine learning algorithms are but how suboptimal human Go players are. Experts believed that machines would not be able to defeat humans at Go for a long time because the number of possible games is astronomically large, \sim 250^{150} moves, in contrast to chess with a paltry \sim 35^{80} moves. Additionally, unlike chess, it is not clear what is a good position and who is winning during intermediate stages of a game. Thus, any direct enumeration and evaluation of possible next moves as chess computers do, like IBM’s Deep Blue that defeated Gary Kasparov, seemed to be impossible. It was thought that humans had some sort of inimitable intuition to play Go that machines were decades away from emulating. It turns out that this was wrong. It took remarkably little training for AlphaGo to defeat a human. All the algorithms used were fairly standard – supervised and reinforcement backpropagation learning in multi-layer neural networks1. DeepMind just put them together in a clever way and had the (in retrospect appropriate) audacity to try.

The take home message of AlphaGo’s success is that humans are very, very far away from being optimal at playing Go. Uncharitably, we simply stink at Go. However, this probably also means that we stink at almost everything we do. Machines are going to take over our jobs not because they are sublimely awesome but because we are stupendously inept. It is like the old joke about two hikers encountering a bear and one starts to put on running shoes. The other hiker says: “Why are you doing that? You can’t outrun a bear.” to which she replies, “I only need to outrun you!” In fact, the more difficult a job seems to be for humans to perform, the easier it will be for a machine to do better. This was noticed a long time ago in AI research and called Moravec’s Paradox. Tasks that require a lot of high level abstract thinking like chess or predicting what movie you will like are easy for computers to do while seemingly trivial tasks that a child can do like folding laundry or getting a cookie out of a jar on an unreachable shelf is really hard. Thus high paying professions in medicine, accounting, finance, and law could be replaced by machines sooner than lower paying ones in lawn care and house cleaning.

There are those who are not worried about a future of mass unemployment because they believe people will just shift to other professions. They point out that a century ago a majority of Americans worked in agriculture and now the sector comprises of less than 2 percent of the population. The jobs that were lost to technology were replaced by ones that didn’t exist before. I think this might be true but in the future not everyone will be a software engineer or a media star or a CEO of her own company of robot employees. The increase in productivity provided by machines ensures this. When the marginal cost of production goes to zero (i.e. cost to make one more item), as it is for software or recorded media now, the whole supply-demand curve is upended. There is infinite supply for any amount of demand so the only way to make money is to increase demand.

The rate-limiting step for demand is the attention span of humans. In a single day, a person can at most attend to a few hundred independent tasks such as thinking, reading, writing, walking, cooking, eating, driving, exercising, or consuming entertainment. I can stream any movie I want now and I only watch at most twenty a year, and almost all of them on long haul flights. My 3 year old can watch the same Wild Kratts episode (great children’s show about animals) ten times in a row without getting bored. Even though everyone could be a video or music star on YouTube, superstars such as Beyoncé and Adele are viewed much more than anyone else. Even with infinite choice, we tend to do what our peers do. Thus, for a population of ten billion people, I doubt there can be more than a few million that can make a decent living as a media star with our current economic model. The same goes for writers. This will also generalize to manufactured goods. Toasters and coffee makers essentially cost nothing compared to three decades ago, and I will only buy one every few years if that. Robots will only make things cheaper and I doubt there will be a billion brands of TV’s or toasters. Most likely, a few companies will dominate the market as they do now. Even, if we could optimistically assume that a tenth of the population could be engaged in producing goods and services necessary for keeping the world functioning that still leaves the rest with little to do.

Even much of what scientists do could eventually be replaced by machines. Biology labs could consist of a principle investigator and robot technicians. Although it seems like science is endless, the amount of new science required for sustaining the modern world could diminish. We could eventually have an understanding of biology sufficient to treat most diseases and injuries and develop truly sustainable energy technologies. In this case, machines could be tasked to keep the modern world up and running with little need of input from us. Science would mostly be devoted to abstract and esoteric concerns.

Thus, I believe the future for humankind is in low productivity occupations – basically a return to pre-industrial endeavors like small plot farming, blacksmithing, carpentry, painting, dancing, and pottery making, with an economic system in place to adequately live off of this labor. Machines can provide us with the necessities of life while we engage in a simulated 18th century world but without the poverty, diseases, and mass famines that made life so harsh back then. We can make candles or bread and sell them to our neighbors for a living wage. We can walk or get in self-driving cars to see live performances of music, drama and dance by local artists. There will be philosophers and poets with their small followings as they have now. However, even when machines can do everything humans can do, there will still be a capacity to sustain as many mathematicians as there are people because mathematics is infinite. As long as P is not NP, theorem proving can never be automated and there will always be unsolved math problems.  That is not to say that machines won’t be able to do mathematics. They will. It’s just that they won’t ever be able to do all of it. Thus, the future of work could also be mathematics.

  1. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

The simulation argument made quantitative

Elon Musk, of Space X, Tesla, and Solar City fame, recently mentioned that he thought the the odds of us not living in a simulation were a billion to one. His reasoning was based on extrapolating the rate of improvement in video games. He suggests that soon it will be impossible to distinguish simulations from reality and in ten thousand years there could easily be billions of simulations running. Thus there are a billion more simulated universes than real ones.

This simulation argument was first quantitatively formulated by philosopher Nick Bostrom. He even has an entire website devoted to the topic (see here). In his original paper, he proposed a Drake-like equation for the fraction of all “humans” living in a simulation:

f_{sim} = \frac{f_p f_I N_I}{f_p f_I N_I + 1}

where f_p is the fraction of human level civilizations that attain the capability to simulate a human populated civilization, f_I is the fraction of these civilizations interested in running civilization simulations, and N_I is the average number of simulations running in these interested civilizations. He then argues that if N_I is large, then either f_{sim}\approx 1 or f_p f_I \approx 0. Musk believes that it is highly likely that N_I is large and f_p f_I is not small so, ergo, we must be in a simulation. Bostrom says his gut feeling is that f_{sim} is around 20%. Steve Hsu mocks the idea (I think). Here, I will show that we have absolutely no way to estimate our probability of being in a simulation.

The reason is that Bostrom’s equation obscures the possibility of two possible divergent quantities. This is more clearly seen by rewriting his equation as

f_{sim} = \frac{y}{x+y} = \frac{y/x}{y/x+1}

where x is the number of non-sim civilizations and y is the number of sim civilizations. (Re-labeling x and y as people or universes does not change the argument). Bostrom and Musk’s observation is that once a civilization attains simulation capability then the number of sims can grow exponentially (people in sims can run sims and so forth) and thus y can overwhelm x and ergo, you’re in a simulation. However, this is only true in a world where x is not growing or growing slowly. If x is also growing exponentially then we can’t say anything at all about the ratio of y to x.

I can give a simple example.  Consider the following dynamics

\frac{dx}{dt} = ax

\frac{dy}{dt} = bx + cy

y is being created by x but both are both growing exponentially. The interesting property of exponentials is that a solution to these equations for a > c is

x = \exp(at)

y = \frac{b}{a-c}\exp(at)

where I have chosen convenient initial conditions that don’t affect the results. Even though y is growing exponentially on top of an exponential process, the growth rates of x and y are the same. The probability of being in a simulation is then

f_{sim} = \frac{b}{a+b-c}

and we have no way of knowing what this is. The analogy is that you have a goose laying eggs and each daughter lays eggs, which also lay eggs. It would seem like there would be more eggs from the collective progeny than the original mother. However, if the rate of egg laying by the original mother goose is increasing exponentially then the number of mother eggs can grow as fast as the number of daughter, granddaughter, great…, eggs. This is just another example of how thinking quantitatively can give interesting (and sometimes counterintuitive) results. Until we have a better idea about the physics underlying our universe, we can say nothing about our odds of being in a simulation.

Addendum: One of the predictions of this simple model is that there should be lots of pre-sim universes. I have always found it interesting that the age of the universe is only about three times that of the earth. Given that the expansion rate of the universe is actually increasing, the lifetime of the universe is likely to be much longer than the current age. So, why is it that we are alive at such an early stage of our universe? Well, one reason may be that the rate of universe creation is very high and so the probability of being in a young universe is higher than being in an old one.

Addendum 2: I only gave a specific solution to the differential equation. The full solution has the form Y_1\exp(at) + Y_2 \exp(ct).  However, as long as a >c, the first term will dominate.

Addendum 3: I realized that I didn’t make it clear that the civilizations don’t need to be in the same universe. Multiverses with different parameters are predicted by string theory.  Thus, even if there is less than one civilization per universe, universes could be created at an exponentially increasing rate.

 

Chomsky on The Philosopher’s Zone

Listen to MIT Linguistics Professor Noam Chomsky on ABC’s radio show The Philosopher’s Zone (link here).  Even at 87, he is still as razor sharp as ever. I’ve always been an admirer of Chomsky although I think I now mostly disagree with his ideas about language. I do remember being completely mesmerized by the few talks I attended when I was a graduate student.

Chomsky is the father of modern linguistics. He turned it into a subfield of computer science and mathematics. People still use Chomsky Normal Form and the Chomsky Hierarchy in computer science. Chomsky believes that the language ability is universal among all humans and is genetically encoded. He comes to this conclusion because in his mathematical analysis of language he found what he called “deep structures”, which are embedded rules that we are consciously unaware of when we use language. He was adamantly opposed to the idea that language could be acquired via a probabilistic machine learning algorithm. His most famous example is that we know that the sentence “Colorless green ideas sleep furiously” makes grammatical sense but is nonsensical while the sentence “Furiously sleep ideas green colorless”, is nongrammatical. Since, neither of these sentences had ever been spoken nor written he surmised that no statistical algorithm could ever learn the difference between the two. I think it is pretty clear now that Chomsky was incorrect and machine learning can learn to parse language and classify these sentences. There has also been field work that seems to indicate that there do exist languages in the Amazon that are qualitatively different form the universal set. It seems that the brain, rather than having an innate ability for grammar and language, may have an innate ability to detect and learn deep structure with a very small amount of data.

The host Joe Gelonesi, who has filled in admirably for the sadly departed Alan Saunders, asks Chomsky about the hard problem of consciousness near the end of the program. Chomsky, in his typical fashion of invoking 17th and 18th century philosophy, dismisses it by claiming that science itself and physics in particular has long dispensed with the equivalent notion. He says that the moment that Newton wrote down the equation for gravitational force, which requires action at a distance, physics stopped being about making the universe intelligible and became about creating predictive theories. He thus believes that we will eventually be able to create a theory of consciousness although it may not be intelligible to humans. He also seems to subscribe to panpsychism, where consciousness is a property of matter like mass, an idea championed by Christof Koch and Giulio Tononi. However, as I pointed out before, panpsychism is dualism. If it does exist, then it exists apart from the way we currently describe the universe. Lately, I’ve come to believe and accept the fact that consciousness is an epiphenomenon and has no causal consequence in the universe. I must credit David Chalmers (e.g. see previous post) for making it clear that this is the only recourse to dualism. We are no more nor less than automata caroming through the universe, with the ability to spectate a few tens of milliseconds after the fact.

Addendum: As pointed out in the comments, there are monoistic theories, as espoused by Bishop Berkeley, where only ideas are real.  My point about the only recourse to dualism is epiphenomena for consciousness, is if one adheres to materialism.

 

 

 

 

 

R vs Matlab

It has now been almost half a year since I switched from Matlab to open source software and I’ve been amazed at how easy the transition has been. I had planned to replace Matlab with Python, Julia, and R but I have found that R and Julia have been sufficient for my requirements. Maxima is also more than adequate to replace Mathematica. I have become particularly attached to R especially after I started to use R Studio as the interface. I had only used R before as just a statistics tool but it really is a complete programming platform like Matlab. It has very nice graphics capabilities and I find the language very nice to program in. I really like its use of lists where I can pile sets of any type and any size into one object. I also like how R Studio can save your work into Projects, which keeps the whole environment and history in one place. I can then switch between multiple projects and everything comes back. The only thing I miss from Matlab is the command completion history feature, where I could easily find a previous command by just typing the first few letters. Also, I haven’t quite figured out how to output R data into a text file seamlessly yet. I seem to always get extraneous row or column information. I use Julia when I want to write a code that needs to loop fast but for everything else I’ve been reaching for R.

Paper on new version of Plink

The paper describing the updated version of the genome analysis software tool Plink has just been published.

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee

GigaScience 2015, 4:7  doi:10.1186/s13742-015-0047-8

Abstract
Background
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

Findings
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, View MathML-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Keywords: GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics

 

This project started out with us trying to do some genomic analysis that involved computing various distance metrics on sequence space. Programming virtuoso Chris Chang stepped in and decided to write some code to speed up the computations. His program, originally called wdist, was so good and fast that we kept asking him to put in more capabilities. Eventually,  he had basically replicated the suite of functions that Plink performed so he contacted Shaun Purcell, the author of Plink, if he could just call his code Plink too and Shaun agreed. We then ran a series of tests on various machines to check the speed-ups compared to the original Plink and gcta. If you do any GWAS analysis at all, I highly recommend you check out Plink 1.9.

The tragic life of Walter Pitts

Everyone in computational neuroscience knows about the McCulloch-Pitts neuron model, which forms the foundation for neural network theory. However, I never knew anything about Warren McCulloch or Walter Pitts until I read this very interesting article in Nautilus. I had no idea that Pitts was a completely self-taught genius that impressed the likes of Bertrand Russell, Norbert Wiener and John von Neumann but was also a self-destructive alcoholic. One thing the article nicely conveys was the camaraderie and joie de vivre that intellectuals experienced in the past. Somehow this spirit seems missing now.

Open source software for math and science

Here is a list of open source software that you may find useful.  Some, I use almost every day, some I have not yet used, and some may be so ubiquitous that you have even forgotten that it is software.

1. XPP/XPPAUT. Bard Ermentrout wrote XPP in the 1980’s as a dynamical systems tool for himself. It’s now the de facto tool for the Snowbird community.  I still find it to be the easiest and fastest way to simulate and visualize differential equations.  It includes the equally excellent bifurcation continuation software tool AUTO originally written by Eusebius Doedel with contributions from a who’s who list of mathematicians.  XPP is also available as an iPad and iPhone App.

2. Julia. I only learned about Julia this spring and now I use it for basically anything I used to use Matlab for.  It’s syntax is very similar to Matlab and it’s very fast. I think it is quickly gaining a large following and may be as comprehensive as Python some day.

3. Python often seems more like a way of life than a software tool. I would probably be using Python if it were not for Julia and the fact that Julia is faster. Python has packages for everything. There is SciPy and NumPy for scientific computing, Pandas for statistics, Matplotlib for making graphs, and many more that I don’t yet know about.  I must confess that I still don’t know my way around Python but my fellows all use it.

4. R. For statistics, look no further than R, which is what academic statisticians use. It’s big in Big Data.  So big that I heard that Microsoft is planning to write a wrapper for it. I also heard that billionaire mathematician James Simons’s hedge fund Renaissance Technologies uses it.  For Bayesian inference there is now Stan, which implements Hamilton Monte Carlo.  We tried using it for one of our projects and had trouble getting it to work but it’s improving very fast.

5. AMS-Latex. The great computer scientist Donald Knuth wrote the typesetting language TeX in 1978 and he changed scientific publication forever. If you have ever had to struggle putting equations into MS Word, you’ll realize what a genius Knuth is. Still TeX was somewhat technical and thus LaTeX was invented as a simplified interface for TeX with built-in environments that are commonly used. AMS-Latex is a form of LaTeX that includes commands for any mathematical symbol you’ll ever need. It also has very nice equation and matrix alignment tools.

6. Maxima. Before Mathematica and Maple there was Macsyma. It was a symbolic mathematics system developed over many years at MIT starting in the 60’s. It was written in the programming language Lisp (another great open source tool but I have never used it) and was licensed by MIT to a company called Symbolics that made dedicated Lisp machines that ran Macsyma.  My Thesis advisor at MIT bought one of these machines (I think it cost him something like 20 thousand dollars, which was a lot of money back then) and I used it for my thesis. I really loved Macysma and got quite adept at it. However, as you can imagine the Symbolics business plan really didn’t pan out and Macysma kind of languished after the company failed. However, after many trials and tribulations, Macsyma was reborn as the open source software tool Maxima and it’s great.  I’ve been running wmMaxima and it can do everything that I ever needed Mathematica for with the bonus that I don’t have to find and re-enter my license number every few months.

7. OpenOffice. I find it reprehensible that scientific journals force me to submit my papers in Microsoft Word. But MS Office is a monopoly and all my collaborators use it.  Data always comes to me in Excel and talks are in PowerPoint. For my talks, I use Apple Keynote, which is not open source. However, Apple likes to completely overhaul their software so my old talks are not even compatible with the most recent version. I also dislike the current version. The reason I went to Keynote is because I could embed PDFs of equations made in LaTeXiT (donation ware). However, the new version makes this less convenient. PDFs looked terrible in PowerPoint a decade ago. I have no idea if this has changed or not.  I have flirted with using OpenOffice for many years but it was never quite 100% compatible with MS Office so I could never fully dispense with Word.  However, in my push to open source, I may just write my next talk in OpenOffice.

8. Plink The standard GWAS analysis tool is Plink, originally written by Shaun Purcell.  It’s nice but kind of slow for some computations and was not being actively updated.  It also couldn’t do some of the calculations we wanted.  So in steps my collaborator Chris Chang who took it upon himself to write a software tool that could do all the calculations we needed. His code was so fast and good that we started to ask him to add more and more to it. Eventually, it did almost everything that Plink and gcta (tool for estimating heritability) could do and thus he asked Purcell if he could just call it Plink. It’s currently called Plink 1.9.

9. C/C++  We tend to forget that computer languages like C, Java, Javascript, Ruby, etc. are all open source software tools.

10. Inkscape is a very nice drawing program, an open source Adobe Illustrator if you will.

11. GNU Project. Computer scientist Richard Stallman kind of invented the concept of open software. He started the free software foundation and the GNU Project, which includes GNU/Linux, the editor emacs, gnuplot among many other things.

Probably the software tools you use most that are currently free (but may not be forever) are the browser and email. People forget how much these two ubiquitous things have completely changed our lives.  When was the last time you went to the library or wrote a letter in ink?

Julia update

I’m totally committed to Julia now.  It is blitzing fast and very similar to Matlab with some small differences (improvements). I particularly like the fact that when you declare an array with one argument, like x = zeros(10), it immediately gives you a vector of length 10 and not a 10 X 10 matrix. The broadcast function is also very useful.  There are still many things I don’t yet know how to do in Julia like how to import and export data.  Plotting is also not fully solved. I’ve been using PyCall to import matplotlib.pyplot, which works pretty well but not perfectly. There are things I miss in Matlab like context dependent history recall, i.e. I can recall an old command line by just typing the first few letters.  If anyone knows how to do this in Julia please let me know.  Right now, I’m hitting the uparrow button continuously until I find the line I want. I do worry that my abandoning Matlab means someone will lose their job, which is certainly not my intention. However, I am well ahead of schedule for zero Matlab.

MCMC for linear models

I’ve been asked in a comment to give a sample of pseudo code for an MCMC algorithm to fit a linear model ax + b to some data. See here for the original post on MCMC. With a linear model, you can write down the answer in closed form (see here), so it is a good model to test your algorithm and code.  Here it is in pseudo-Julia code:

#  initial guess for parameters a and b 
a=0
b=0
# construct chi squared, where D is the data vector and x is the vector of the
# independent quantity
chi = norm(D - (a*x +b))^2;
for n = 1 : total;
# Make random guesses for new normally distributed a and b with mean old a and b
# and standard deviation asig and bsig
at = a + asig * randn()
bt = b + bsig * randn() chit = norm(D - (at*x + bt))^2;
# Take ratio of likelihoods, sigma is the data uncertainty
ratio=exp((-chit + chi)/(2*sigma^2));
# Compare the ratio to a uniform random number between 0 and 1, 
# keep new parameters if ratio exceeds random number
if rand() < ratio a = at;
b = bt; chi = chit; end

end

# Keep running until convergence

Julia vs Python update

We’ve been using Julia for a little over a month now and I am quite pleased thus far. It is fast enough for us to get decent results for our MCMC runs; runs for which Python and Brian were too slow. Of course, we probably could have tried to optimize our other codes but Julia installs right out of the box and is very easy to program in. We still have issues for plotting but do have minimal plot functionality using PyCall and importing matplotlib.pyplot. One can also dump the results to a file and have either Python or some other software make the plots. I already find myself reaching for Julia when I want to write a quick code to solve a problem instead of relying on Matlab like I used to. While I would like to try PyDSTool, the lack of a trivial installation is holding us back. For you budding software developers out there, if it takes more than one click on a link to make it work on my computer, I am likely to go to something else. The reason I switched from Linux to Mac a decade ago was that I wanted Unix capability and the ability to print without having to consult with a sys admin.

Numerical integration

Solving differential equations numerically is a very mature field with multiple algorithms carefully explained in multiple textbooks. However, when it comes down to actually solving them in practice by nonspecialists, a lot of people, myself included, will often resort to the Euler method. There are two main reasons for this. The first is that it is trivial to remember and code and the second is that Moore’s law has increased the speed of computing sufficiently that we can get away with it. However, I think it’s time to get out of the eighteenth century and at least move into the nineteenth.

Suppose you have a differential equation \dot{x}=f(x,t), where x can be a scalar or a vector in an arbitrary number of dimensions.  The Euler method simply discretizes the time derivative via

\frac{dx}{dt} \rightarrow \frac{x(t+h)-x(h)}{h}

so that x at time step t+h is given by

x(t+h)=x(t)+h*f(x(t),t)

Now what could be simpler than that? There are two problems with this algorithm. The first is that it is only accurate to order h so you need to have very small steps to simulate accurately. The second is that the Euler method is unstable if the step size is too large. The way around these limitations is to use very small time steps, which is computationally expensive.

The second most popular algorithm is probably (4th order) Runge-Kutta, which can generally use larger time steps for the same accuracy. However, a higher order algorithm is not necessarily faster because what really slows down a numerical scheme is the number of function evaluations it needs to make and 4th order Runge-Kutta needs to make a lot of function evaluations. Runge-Kutta is also plagued by the same stability problems as Euler and that can also limit the maximum allowed step size.

Now, these problems have been known for decades if not centuries and work-arounds do exist. In fact, the most sophisticated solvers such as CVODE tend to use adaptive multi-step methods, like Adams-Bashforth (AB) or it’s implicit cousin Adams-Moulton and this is what we should all be implementing in our own codes. Below I will give simple pseudo-code to show how a 2nd order Adams-Bashforth scheme is not much more complicated to code than Euler and uses the same number of function evaluations. In short, with not much effort, you can speed your code immensely by simply switching to AB.

The cleverness of AB is that it uses previous steps to refine the estimate for the next step. In this way, you get a higher order scheme without having to re-evaluate the right hand side of your ODE. You simply reuse your computations. It’s the ultimate green scheme. For the same ODE example above 2nd order AB is simply

x(t+2h)=x(t+h)+h\left(\frac{3}{2}f(x(t+h),t+h)-\frac{1}{2}f(x(t),t)\right)

You just need to store the previous two evaluations to compute the next step.  To start the algorithm, you can just use an Euler algorithm to get the first step.  

Here is the 2nd order AB algorithm written in pseudo Julia code for a one dimensional version of the ODE above.

# make x a two dimensional vector for the two time steps

# Set initial condition

x[1]=0;

# Store evaluation of function f(x,t) for later use

fstore[1]= f(x[1],h)

# Take Euler step

x[2] = x[1] + h*fstore[1]

# store function at time = 2h

fstore[2]  = f(x[2],2*h)

# Store computed values for output

xout[1]=x[1]

xout[2]=x[2]

# Take AB steps

for t = 3:Tfinal

     # set update indices for x based on parity of t

     index2 = t%2+1

     index1 = (t-1)%2+1

     # 2nd order AB step

     x[index1] = x[index2]+h*(1.5*fstore[2]-0.5*fstore[1]))

     # update stored function

     fstore[index1] = f(x[index1],t*h)

     xout[t]=x[index1]

end

return xout

Zero Matlab 2015

So I have bitten the bullet and am committed to phasing out Matlab completely by 2015.  I have Julia installed and can sort of run code in it although I have no idea how to change directories.  cd() doesn’t seem to work for me.  I have also tried to install matplotlib on my MacBook Pro running OS 10.8 using a SciPy superpack but it does not seem to work yet.  When I try to plot something, nothing happens but I have some bouncing rocket ships in my dock.  Feel free to let me know what I’m doing wrong.

 

Addendum: I installed iPython using Pip and I can plot out of iPython

The need for speed

Thanks for all the comments about the attributes of Python and Julia. It seems to me that the most prudent choice is to learn Python and Julia. However, what I would really like to know is just how fast these languages really are and here is the test. What I want to do is to fit networks of coupled ODEs (and  PDEs) to data using MCMC (see here). This means I need a language that loops fast. An example in pseudo-Matlab code would be

for n = 1:N

for i = 1:T

y(i+1) = M\y(i)

end

Compare to data and set new parameters

end

where h is a parameter and M is some matrix (say 1000 dimensional), which is sometimes a Toeplitz matrix but not always. Hence, in each time step I need to invert a matrix, which can depend on time so I can’t always precompute, and do a matrix multiplication. Then in each parameter setting step I need to sum an objective function like the mean square error over all the data points. The code to do this in C or Fortran can be pretty complicated because you have to keep track of all the indices and call linear algebra libraries. I thus want something that has the simple syntax of Matlab but is as fast as C. Python seems to be too slow for our needs but maybe we haven’t optimized the code. Julia seems like the perfect fit but let me know if I am just deluded.

Julia vs Python

I was about to start my trek up Python mountain until Bard Ermentrout tipped me to the Julia language and I saw this speed table from here (lower is faster):

Fortran Julia Python R Matlab Octave Mathe-matica JavaScript Go
gcc 4.8.1 0.2 2.7.3 3.0.2 R2012a 3.6.4 8.0 V8 3.7.12.22 go1
fib 0.26 0.91 30.37 411.36 1992.00 3211.81 64.46 2.18 1.03
parse_int 5.03 1.60 13.95 59.40 1463.16 7109.85 29.54 2.43 4.79
quicksort 1.11 1.14 31.98 524.29 101.84 1132.04 35.74 3.51 1.25
mandel 0.86 0.85 14.19 106.97 64.58 316.95 6.07 3.49 2.36
pi_sum 0.80 1.00 16.33 15.42 1.29 237.41 1.32 0.84 1.41
rand_mat_stat 0.64 1.66 13.52 10.84 6.61 14.98 4.52 3.28 8.12
rand_mat_mul 0.96 1.01 3.41 3.98 1.10 3.41 1.16 14.60 8.51

Julia is a dynamic high level language like MATLAB and Python that is open source and developed at MIT. The syntax looks fairly simple and it is about as fast as C (Fortran looks like it still is the Ferrari of scientific computing). Matlab is fast for vector and matrix operations but deadly slow for loops. I had no idea that Mathematica was so fast. Although Julia is still relatively new and not nearly as expansive as Python, should I drop Python for Julia?

The MATLAB handcuff

The first computer language I learned was BASIC back in the stone age, which led directly to Fortran. These are procedural languages that allow the infamous GOTO statement, now shunned by the computer literati. Programming with the GOTO gives you an appreciation for why the Halting problem is undecidable.  Much of what I did in those days was to track down infinite loops. I was introduced to structured programming in university, where I learned Pascal. I didn’t really know what structured programming meant except that I no longer could use GOTO and there were data structures like records. I was forced to use APL at a summer job. I have little recollection of the language except that it was extremely terse and symbolic. It was fun to try to construct the shortest program possible to do the task. The ultimate program was the so-called “APL one liner”. APL gave me first hand experience of the noncomputability of Kolmogorov complexity. In graduate school I went back to Fortran, which was the default language to do scientific computing at that time. I also used the computer algebra system called Macsyma, which was much better than Mathematica. I used it to do Taylor expansions and perturbation theory. I was introduced to C and C++ in my first postdoc. That was an eye-opening experience as I never really understood how a computer worked until I programmed in C. Pointer arithmetic was a revelation. I now had such control and power. C++ was the opposite of C for me. Object oriented programming takes you very far away from the workings of a computer. I basically programmed exclusively in C for a decade – just C and XPP, which was a real game changer. I had no need for anything else until I got to NIH. It was only then that I finally sat down and programmed in MATLAB. I had resisted up to that point and still feel like it is cheating but I now almost do all of my programming in MATLAB, with a smattering of R and XPP of course. I’m also biased against MATLAB because it gave a wrong answer in a previous version. At first, I programmed in MATLAB as I would in C or Fortran but when it came down to writing the codes to estimate heritability directly from GWAS (see here), the matrix manipulating capabilities of MATLAB really became useful. I also learned that statistics is basically applied linear algebra. Now, when I code I think instinctively in matrix terms and it is very hard for me to go back to programming in C. (Although I did learn Objective C recently to write an iPhone App to predict body weight. But that was mostly point-and-click and programming by trial and error. The App does work though (download it here). I did that because I wanted to get a sense of what real programmers actually do.) My goal is to switch from MATLAB to Python and not rely on proprietary software. I encourage my fellows to use Python instead of MATLAB because it will be a cinch to learn MATLAB later if they already know Python. The really big barrier for me for all languages is to learn the ancillary stuff like what do you actually type to run programs, how does Python know where programs are, how do you read in data, how do you plot graphs, etc? In MATLAB, I just click on an icon and everything is there. I keep saying that I will uncuff myself from MATLAB one day and maybe this is the year that I actually do.

Big Data backlash

I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.