Protein Computer (Dec 2020)

Protein Computer

Google's DeepMind has recently figured out how to determine the shape of 100-residue proteins (proteins consisting of a chain of 100 amino acids) in a few hours. This will improve. This of course enables knowing the shape of every natural protein, tells you what shape drugs should be to bind to any given natural protein, and greatly accelerates discovery of drugs that bind to that shape.

It also lets you build a database of millions of proteins indexed by shape. You could then use cad-cam to assemble things out of those shapes, and since you can build those proteins you can build those assemblies.

(We've been able to build arbitrary proteins for a long time. You have a root for a DNA chain, then you wash it with the first DNA base you want, then you wash it again with the second DNA base you want, and so forth, and it builds up a DNA chain of your choosing. Then you feed that DNA chain to a ribosome and it produces the protein of your choosing. You have to build it with DNA not RNA because only DNA is stable enough to last through this slow build process. So we could build an arbitrary protein, we just didn't have a way of predicting how that arbitrary protein would behave.)

What is worth building at the molecular level? A computer, both computation and storage. To be competitive with silicon in speed, it needs to move around minimal mass. This probably means reconfiguring or shifting charge in fixed proteins, rather than moving around whole molecules and doing chemical reactions. Electrical heating and water cooling should be built into the scaffolding. Not only would the elements be at least an order of magnitude smaller than silicon, it is natural to construct them in 3d rather than 2d. You build components and let them snap together.

That comes out to about a million computers per cubic millimeter, or a quadrillion per cubic meter. This is just a new base for computers. The higher level logic stays the same. A modern CPU is about a billion transistors. Wild guess of 1000 atoms per transistor and 1/10th the volume being transistors, that's 10e13 atoms. WolframAlpha tells me 10e13 atoms of activated charcol is 10e-15 meters cubed, or 10e-6 millimeters cubed.

The internet says the speed limit for protein folding is about N/100 microseconds for a protein with N residues. So if a NOR gate is accomplished by slightly moving 2 amino acids, that's a speed limit 1/50th of a microsecond. There are typically 20-50 NOR gates per clock cycle, so that implies a cycle speed of around 1MHz, 2000x slower than today's datacenter chips. A NOR gate based on protein electron transfer, or redistribution of charge, might be faster. I expect anything that requires moving atomic nuclei is going to be slow.

Computer storage is going to be more reliable if it is based on moving atomic nuclei than on trapping electrons, but it will also be slower. So look for two-stage computer storage, where the first stage moves electrons or shifts charge, and an asynchronous later stage relies on moving atoms to distinctive positions. Reading memory still needs to rely on charge, even if the position of atoms is what persists it, or it will be slow.

Even if these are 2000x slower than the computers than we have today, this allows many orders of magnitude more of them. Cloud computing already can split up large jobs into arbitrarily many small pieces, where each piece runs on its own computer, and the pieces are run simultaneously. The size of jobs can scale with the number of computers even if each computer doesn't get faster. Cloud computing deals with clusters of thousands of computers today. Local clusters will soon need to scale to trillions of times more machines. This will probably be handled by layers, where a group of computers from one layer is treated as a single computer by the layer above it.

A vaguely similar thing is yeast memory. It'll be MUCH slower than a true protein computer, but may be easier to build, because the yeast cells have already solved a bunch of the engineering challenges.

Yeast can be engineered to have a genome of 600K base pairs. It already has junk DNA, and more can be added, around 12M base pairs per cell. The basis of yeast memory is to store data in that 12M. each pair is 2 bits, and there are 8 bits per byte, so that's 3MB per cell. Yeast knows how to maintain the integrity of its DNA, keep the cell alive, and split (copying the DNA).

You'd need more than one cell in case it died, but a small cluster will survive for awhile if you feed it. If you have several clusters, they would die off independently, not all of the same cause. I'm guessing 4 separate clusters of 4 cells each would be reliable enough for computer storage. The number of cells per cluster and number of clusters can definitely be adjusted to make it reliable enough.

In addition to the yeast, you need feeding tubes and waste disposal ... essentially a whole artificial organ. There should be multiple paths so blocking one tube doesn't kill off a large swath of cells. The yeast takes care of maintaining and replicating the actual data, but the feeding tubes are a still-unsolved engineering challenge. Getting the data in is a similar network of wires, then some translation at the cells of electronic bits to DNA to be written. Getting data out is the reverse.

A yeast cell with a genome of 12M base pairs average 9e-17 meters cubed. If you need 4 clusters of 4 cells each (wild guess) to store 3MB, and cells occupy 1/4 of the volume (another wild guess), that's 500 exabytes per cubic meter.

Data will accidentally mutate over time. This is similar to existing SSDs. The trick is to make it mostly reliable for some period of time, then read it, do error correction, and write it again. New data would be appended to the yeast genome and old data would be removed from the end and be discarded (because it's no longer needed) or written again as new data. If one cluster dies unexpectedly, read the data is had from the remaining clusters and append new copies of that data to other clusters.

Reading this data usually means looking up arbitrary offsets in the DNA. This is similar to gene expression. I wasn't able to find any timings on that at all, but new proteins are not produced until at least 15 to 30 minutes after they are requested. So data retrieval is likely that slow, on the order of a few minutes. Tape storage today is similar. You'd have to write data in parallel and read data in parallel across trillions of clusters to get decent throughput.