We need Google genome

I listened to two Long Now Foundation talks on my way to Newark, Delaware and back yesterday for my colloquium talk.  These podcasts tend to be quite long, so they were perfect for the drive.  The first was by environmental activist and journalist Mark Lynas and the second by National Geographic photographer Jim Anderson.  Both were much more interesting than I expected.  Lynas, who originated the anti-genetically modified organism (GMO) food movement in Europe in the 1990s, has since changed  his mind and become more pragmatic.  He now advocates for a more rational environmental movement that embraces technological solutions such as GMO foods and nuclear energy.  He argues that many more people are killed by particulate matter from coal-fired generating plants in a year than over the entire history of nuclear use.  I have always felt that nuclear power is the only viable technology to reduce carbon emissions.  I have also argued previously that  I’m more worried about the acidification of the ocean due to CO2 than an increase temperature.  I think we should start building CANDU reactors now and head towards fast breeder reactors.

Jim Anderson talked about the loss of diversity of domesticated plants and animals and how they are essential for the survival of humans.  For the first 9,900 years of agriculture, we increased the diversity of our food stuff.  For the last hundred, we have gone in the other direction. We used to have hundreds to thousands of varieties of fruits and vegetables and now we’re down to a handful.  There are at most 5 varieties of apples I can buy at my local supermarket, yet a hundred years ago, each orchard would produce its own variety.  This leaves us extremely vulnerable to diseases.  The world’s banana supply is dominated by one variety (the Cavendish) and it is under siege by a fungus that threatens to wipe it out.  The Irish potato famine was so severe because they relied on only two varieties that were both susceptible to the same blight. Our fire wall against future blights are seed banks, where we try to preserve as many varieties as we can.  However, not all seeds can remain viable forever.  Many have to be planted every few years from which new seeds are harvested.  This replanting is often done by amateur horticulturists.  The podcast made me think that with the cost of genome sequencing dropping so rapidly, what we need now is for someone, like Google, to start sequencing every living being and making it publicly available, like Google Books.  In fact, if sequencers become cheap enough, this could be done by amateurs.  You would find some plant or animal, document it as well as you can, and upload the sequence to the virtual seed bank.  This can be a record of both wild and domesticated species.  We can then always resurrect one if we need to.  There could also be potential for mischief with highly dangerous species like small pox or anthrax, so we would need to have a public discussion over what should be available.

Financial crisis on Frontline

If you have any interest at all in economics, finance, or the financial crisis, then I highly recommend PBS Frontline’s series on Money, Power and Wall Street, which can be accessed here.  I watched episode two last night on how the 2008 bank bailouts were engineered in 2008.   Then New York Fed president Tim Geithner favored bailouts, following the game plan established by Robert Rubin, Larry Summers and Alan Greenspan during the 1990’s.  While then Treasury Secretary Hank Paulson was initially hesitant because of the moral hazard.  However, he did approve of the bailout of Bear Stearns in March, 2008 where the US Federal Reserve basically gave JP Morgan 30 billion dollars to buy it. Paulson had a total change of heart after he let Lehman Brothers fail in September, 2008.  The credit market seized and he started to panic when even non-financial institutions started to complain that they would have trouble operating because their access to loans had dried up.  This led to the Troubled Asset Relief Program (TARP) where Paulson forced all the major banks to take money without any conditions.  I thought that we should have let the banks fail in 2008.  The US Fed could then temporarily nationalize the failed banks and start anew.  They could even retain most of the lower level staff so institutional knowledge would not be completely lost.   Instead of bolstering insolvent banks, we could have lent to small businesses and homeowners directly.

Known unknown unknowns

I  listened today to the podcast of science fiction writer Bruce Sterling’s Long Now Foundation talk from 2004 on “The Singularity: Your Future as a Black Hole”.  The talk is available here.  Sterling describes some of the ideas of  mathematician and science fiction writer Vernor Vinge’s conception of the singularity as a scary moment in time where super human intelligence ends the human era and we have no way to predict what will happen.  I won’t address the issue of whether or not such a moment in time will or not happen in the near future or ever.  I’ve posted about it in the past (e.g. see here, here and here).  What I do want to discuss is whether or not there can exist events or phenomena that are so incomprehensible that it will reduce us to a quivering pile of mush.  I think an excellent starting point is former US Secretary of Defense, Donald Rumsfeld’s infamous speech from 2002 regarding the link between Iraq and weapons of mass destruction prior to the Iraq war, where he said:

[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – there are things we do not know we don’t know.

While Rumsfeld was mocked by the popular media for this seemingly inane statement, I actually think (geopolitical consequences aside) that it was the deepest thing I ever heard him say. Rumsfeld is a Bayesian! There is a very important distinction between known unknowns and unknown unknowns.  In the first case, we can assign a probability to the event.  In the second we cannot.  Stock prices are known unknowns, while black swans are unknown unknowns.  (Rumsfeld’s statement predates Nassim Taleb’s book.)  My question is not whether we can predict black swans (by definition we cannot) but whether something can ever occur that we wouldn’t even be able to describe it much less understand it.

In Bayesian language, a known unknown would be any event for which a prior probability exists.    Unknown unknowns are events for which there is no prior.  It’s not just that the prior is zero, but also that it is not included in our collection (i.e. sigma algebra) of possibilities. Now, the space of all possible things is uncountably infinite so it seems likely that there will be things that you cannot imagine. However, I claim that by simply acknowledging that there exist things that I cannot possible ever imagine, is sufficient to remove the surprise. We’ve witnessed enough in the post-modern world, to assign a nonzero prior to the possibility that anything can and will happen. That is not to say that we won’t be very upset or disturbed by some event. We may read about some horrific act of cruelty tomorrow that will greatly perturb us but it won’t be inconceivably shocking.  The entire world could vanish tomorrow and be replaced by an oversized immersion blender and while I wouldn’t be very happy about it and would be extremely puzzled by how it happened, I would not say that it was impossible. Perhaps I won’t be able to predict what will happen after the singularity arrives but I won’t be surprised by it.

Friends on Quirks and Quarks

Two of my old colleagues were interviewed on the CBC radio science show Quirks and Quarks recently.  This is the show I used to listen to in my youth in Canada. In March, astrophysicist Arif Babul, a classmate at the University of Toronto, talked about recent work he had done on abnormal clumping of dark matter in a collision site between clusters of galaxies. Here is the link.  Neuroscientist  Sebastian Seung, whom I’ve known since graduate school, talked about his recent book Connectome. Link here.  I was impressed by how well both were able to explain their work in clear and simple terms.  Their use of metaphors was particularly good.   I think these are two very good examples of how to talk about science to the general public.

Financial Fraud and incentives

I highly recommend this EconTalk podcast on financial fraud with William Black.  It gives a great summary of how financial fraud is perpetrated.  It also clearly highlights the difference in opinions on the cause of the recent financial crisis.    What most people seem to agree on is that the financial crisis was caused by a system wide Ponzi scheme.  Lots of mortgages were issued (some were liar loans where the recipients clearly were not qualified, while some were just risky); these loans were then packaged into mortgage-backed securities (e.g. CDOs) by investment banks who them sold them to other banks and institutional investors like hedge funds and pension funds.  As long as housing prices increased, everyone made money.  Homeowners didn’t care if they couldn’t pay back the loan because they could always sell the house.  The mortgage lenders didn’t care if the loans go bad because they didn’t hold the loans and made money off of the fees. Like a classic Ponzi scheme, when the bubble burst everyone lost money, except for those who got out early.

The motivation for the mortgage lenders seem quite straightforward.  They were making money on fees and the more mortgages the more fees.  When they ran out of legitimate people to lend to, they just lent to riskier and riskier people.  The homeowners were simply caught in the bubble mania at the time.  I remember people telling me I was an idiot for not buying and that house prices never go down. The question at dispute is what were the incentives for the investment banks and institutional investors to go along with this scheme. Why were they fueling the bubble? Libertarians like Russ Roberts, the host of EconTalk, believes that they didn’t care if the bubble burst because they knew they would be bailed out by the government.  To him, the moral hazard due to government intervention was the culprit.   The pro-regulators, like Black, believe that this was a regulatory failure in that the incentives were to commit and reward fraud.  The institutional investors simply didn’t do the due diligence that they should have because the money was rolling in.  He cites the example of Enron, where incredible profits were booked through accounting fraud, which not only didn’t raise alarm to investors, but only attracted more money leading to  more incentives to perpetuate the fraud.  He also says that thousands of people were prosecuted for fraud in the eighties after the Savings and Loans crisis but no one has been prosecuted this time around. He believes that unless we criminalize financial fraud, this will only continue.  The third view, shared by people like Steve Hsu and Felix Salmon, was that the crisis was mostly due to incompetence (Steve calls this bounded cognition).  The investment banks had too much confidence or didn’t fully understand the risks in their financial instruments. They thought they could beat the system. The fourth possibility is that people knew it was a Ponzi scheme but either felt trapped and couldn’t act or thought they could get out in time. It was a pure market failure in that the rational course of action led to a less efficient outcome for everyone.


Heritability and GWAS

Here is the backstory for the paper in the previous post.  Immediately after the  human genome project was completed a decade ago, people set out to discover the genes responsible for diseases. Traditionally, genes had been discovered by tracing the incidence in families that exhibit the disease. This was a painstaking process – a classic example being the discovery of the gene for Huntington’s disease.  The  completion of the  human genome project provided a simpler approach.  The first thing was to create what is known as a haplotype map or HapMap. This is a catalog of all the common genome differences between people.  The genome between humans only differ by about 0.1%. These differences include  Single Nucleotide Variations (SNPs) where a given base (A,C,G,T) is changed and  Copy Number Variations (CNV) where there are differences in the number of copies of segments of DNA.   There are about 10 million common SNPs.

The Genome Wide Association Study (GWAS) usually looks for differences in SNPs between people with and without diseases.  The working hypothesis at that time was that common diseases like Alzheimer’s disease or Type II diabetes should be due to differences in common SNPs (i.e. common disease, common variant). People thought that the genes for many of these diseases would be found within a decade. Towards this end, companies like Affymetrics and Illumina began making microarray chips, which consist of tiny wells with snippets of complement DNA (cDNA) of a small segment of the sequence around each SNP.  SNP variants are found by seeing what types of DNA fragments in genetic samples (e.g. saliva) get bound to the complement strands in the  array. A classic GWAS study then considers the differences in SNP variants observed in disease and control groups. For any finite sample, there will always be fluctuations  so a statistical criterion must be established to evaluate whether a variant is significant. This is done by computing the probability or p-value of the occurrence of a variant due to random chance.  However, in a set of a million SNPS, which is standard for a chip, the probability of one SNP being randomly associated with a disease is a million fold higher if the SNPs are all independent (i.e. it’s just the sum of the probabilities of each event (see Bonferonni correction)). However, since SNPs are not always independent, this is a very conservative criterion.

The first set of results from GWAS  started to be published shortly after I arrived at the NIH in 2004 and they weren’t very promising. A small number of SNPs were found for some common diseases but they only conferred a very small increase in  risk even when the disease were thought to be highly heritable. Heritability is a measure of the proportion of phenotypic variation between people explained by genetic variation.  (I’ll give a primer on heritability in a following post.) The one notable exception was age-related macular degeneration for which five SNPs were found to be associated with a 2 to 3 fold increase in risk. The difference between the heritability of a disease as measured by classical genetic methods and what was found to be explained by SNPs came to be known as the “Missing heritability” problem. I thought this was an interesting puzzle but didn’t think much about it until I saw the results from this paper (summarized on Steve Hsu’s blog), which showed that different European subpopulations could be separated by just projecting onto two principal components of 300,000 SNPS of 6000 people. (Principle components are the eigenvectors of the 6000 by 6000 correlation matrix of the 300,000 dimensional SNP vectors of each person.)  This was a revelation to me because it implied that although single genes did not carry much weight, the collection of all the genes certainly did.  I decided right then that I would show that the missing heritability for obesity was contained in the “pattern” of SNPs rather than in single SNPs. I did this without really knowing anything at all about genetics or heritability.

Continue reading