DNA Storage 04/13/2020
66 Slides7.92 MB
DNA Storage 04/13/2020
Outlines DNA storage overall structure Encode/ECC code Basic operations: synthesis and sequencing Some existing works for DNA storage: DNA archive storage system OligoArchive for database system Puddle: A Dynamic, Error-Correcting, Full-Stack Microfluidics Platform
Basics Nucleotides: molecules that form the building blocks of DNA. Adenine (A) Thymine (T) Cytosine (C) Guanine (G), Naturally occurring DNA double helix with two strands of nucleotides DNA for data storage: oligonucleotide (oligo) a sequence of nucleotides Synthetic/Artifical DNA: synthesized using a chemical process that assembles the DNA one nucleotide at a time.
Overall structure of DNA storage Organick, Lee, et al. "Random access in large-scale DNA data storage." Nature biotechnology 36.3 (2018): 242.
Encoding Optimal Capacity/Density 2bit/nt: 2 bits per base pair (BP is one pair of nucleotides) A 00, C 01, G 10 and T 11 Optimal density is just an ideal case: Biochemical limitation G C base pair: harder to break - less efficient PCR. homopolymer runs - higher sequencing error rates. (e.g., AAAAAAAAAAATTTTTT) DNA strands not only contain payload information Primers Internal addressing/indexing ECC codes Erlich et al., Science 355, 950–954 (2017) 3 March 2017
Encoding examples in DNA storage Early Works (inefficient error-prone ) Run number of 0 or 1. Morse code: (C dot, T dash, A word space and G letter space) [Church’12] (A, C 0 and T, G 1) George M. Church et al. Science 2012;337:1628
Goldman’13 Ternary code in combination with the Huffman encoding Rotation Code Prevent homopolymers Parity Check Overlapping redundancy
polyprimer key (PPK) Reed-Solomon (RS) Code Error Correction Multiple Copy XOR Reed–Solomon codes Huffman encoding Fountain codes Other codes: comma code, the comma-free code and the alternating code. encode words in a text - rarely used. Different Coding Strategy
DNA Storage: Synthesis / Sequencing Synthesis oligonucleotide arrays using a chemical process that assembles the DNA one nucleotide at a time. Sequencing 1st gen: division of a long DNA strand 2nd gen: fluorescence-based detection and automated analysis require DNA template amplification - error 3rd gen: nanopore sequencing Real-time, single-molecule sequencing Nanopore Sequencing: https://patentimages.storage.googleapis.com/34/ce/6b/21fc9150a93517/US5795782.pdf
DNA Synthesis https://www.youtube.com/watch?time continue 3&v rD5uNAMbDaQ&feature emb title
https://en.wikipedia.org/wiki/Polymerase chain reaction
Sanger sequencing:
Limitations of DNA Synthesis/Sequencing Data layout and random access Max oligo length: ranges from a few hundred to a few thousand nucleotides at best. No logical addressing like block-based disk and tape. Address should be encoded along Synthesis & sequencing errors Synthesis: longer oligo more errors and truncated by products Sequencing: short-read - more substitution errors; long-read inserts or deletes spurious nucleotides. Error correction code is necessary
A DNA-Based Archival Storage System James Bornholt† Randolph Lopez† Douglas M. Carmean‡ Luis Ceze† Georg Seelig† Karin Strauss‡ † University of Washington ‡ Microsoft Research ASPLOS’16 Presented by: Fenggang Wu 7/12/2019
Summary Context: the exponential growth rate easily exceeds our ability to store it. Opportunity: DNA is extremely dense and long lasting. Problem: How to store and retrieve data based on DNA? Solution: DNA-based archival storage system Key-value store. Random access. Evaluations using wet lab experiment and simulation. Demonstrate feasibility, random access, and robustness.
Motivation
Background DNA Basics https://www.genome.gov/Pages/Education/Modules/BasicsPresentation.pdf
Background PCR: polymerase chain reaction PCR: a method for exponentially amplifying the concentration of selected sequences of DNA within a pool. Primers: The DNA sequencing primers are short synthetic strands that define the beginning and end of the region to be amplified.
Background DNA Synthesis Arbitrary single-strand DNA sequences can be synthesized chemically, nucleotide by nucleotide. Synthesizing error limits the size of the oligonucleotides ( 200 nucleotides). truncated byproducts Parallel synthesize [1]: synthesize 10 5 different oligonucleotides in parallel. [1] S. Kosuri and G. M. Church. Large-scale de novo DNA synthesis: technologies and applications. Nature Methods, 11:499–507, 2014.
Background DNA sequencing The DNA strand of interest serves as a template for PCR. Fluorescent nucleotides are used during this sequencing process. Read out the complement sequence optically. Read error. ( 1%)
Overview basic unit: DNA strand that is roughly 100-200 nucleotides long, capable of storing 50-100 bits total. data object: maps to a very large number of DNA strands. The DNA strands will be stored in pools stochastic spatial organization structured addressing: impossible address: embedded into the data stored in a strand
Interface and Addressing Object Store: Put(key, value) / Get(key). Random access: mapping a key to a pair of PCR primers. write: primers are added to the strands read: those same primers are used in PCR to amplify only the strands with the desired keys. Separating the DNA strands into a collection of pools: Primers: reaction/limited number Error-prone.
System Operation
Encoding Base 4 encoding: 00, 01, 10, 11 A, T, G, C. Error prone: synthesis, PCR, sequencing (substitutions, insertions, and deletions of nucleotides) Base 3 Huffman code rotation code
Data Format To aid decoding, two sense nucleotides (“S” in Figure 6) indicate whether the strand has been reverse complemented (this is done to avoid certain pathological cases)
Adding Redundancy Goldman Encoding XOR Encoding
Other Factors Primer effectiveness Different error rate in the location of DNA strand Tunable redundancy
Evaluation Wet lab experiment Simulation
Experiment
Simulation
Summary DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. Background: Basics, Synthesize, PCR, Sequencing DNA archival system design: Overview, addressing (primer), data format (payload), encoding, redundancy. Evaluation (Experiment, Simulation) feasibility, random access, and robustness Conclusion: practicality. Time to borrow back from the biotechnology indusy.
Background
Limitations of DNA Synthesis/Sequencing Data layout and random access Max oligo length: ranges from a few hundred to a few thousand nucleotides at best. No logical addressing like block-based disk and tape. Address should be encoded along Synthesis & sequencing errors Synthesis: longer oligo more errors and truncated by products Sequencing: short-read - more substitution errors; long-read inserts or deletes spurious nucleotides. Error correction code is necessary
Limitations of DNA Synthesis/Sequencing Structural complexity Patterns amplifying both synthesis and sequencing errors homopolymer repeats (multiple consecutive occurrence of the same nucleotide) high GC content (higher Gs and Cs than As and Ts) Low-complexity regions - more error in sequencing same sequence of simple amino acids at similar positions. Encoder should avoid such problematic oligonucleotide sequences
Design
Overview Encoding/Decoding Synthesizing/Sequencing
Design DNA Data Storage Query Processing Encoding Data Decoding Data Selection/Projection Join
DNA Storage -- Encoding Extraction and preprocessing dictionary encoding: convert variable length string fields into fixed length integers DNA data representation Integers stream - ternary digit (Huffman code, base3) - Nucleotide (rotation code) Schema-aware encoding primary keys - primer. short records - pack into one oligo. long records, breaking by attributes and store in multiple oligos. Error correction metadata Detect: parity nucleotide; Correction: duplicating and reverse complementing oligo
DNA Storage – Encoding (Cont’d)
DNA Storage -- Decoding Merging the forward and reverse reads perform valid check to remove invalid oligos Translate back to data Nucleotides - trinary digit (base 3) - data Restore the records using the data
Query Processing DNA used for computation: excellent parallelism Hamiltonian path problem [1] or the strategic assignment problem [27]. See paper replicate logical gates using DNA polymerase [5] Neither provides new computational capacity, nor is particularly fast But, it can work on countless oligos in parallel Usage in Query Processing PCR DNA assembly methods
Query Processing -- Selection & Projection Selection: Select target attribute (row) Primer: attribute we want to use, Projection: refers to subset of the set of all columns found in a table, that you want returned. Only amplify attributes of our interest. SELECT Orders.OrderID, Orders.CustomerName FROM Orders WHERE Orders.OrderDate ‘7/25/2019’;
Query Processing -- Join SELECT Orders.OrderID, Customers.CustomerName FROM Orders INNER JOIN Customers ON Orders.CustomerID Customers.Cus tomerID WHERE Orders.OrderID 12345; CustomerID OrderID CustomerName CustomerID
Evaluation
Setup PostgreSQL v10.3 pg oligo dump - file pg oligo restore - file Sequencing: Illumina NextSeq 500 platform TPC-H SF 10 4 dataset: 12KB, 44 records
Encoding and Synthesis Oligo length: 138 nucleotides 91 for data, 5’ primer 26, 3’ primer 21 Result: 1941 oligos total (103 ng): 346 oligos for dictionary, 150 oligos for 12KB of TPC-H data, and irrelevant oligos
Query Processing Selection: a single record is successfully selected. Join: one pair of matching oligos joins in a background of 105 irrelevant or nonmatching oligos.
Summary DNA is promising Storage Media: Density, durability OligoArchive: an architecture for using DNA as the archival tier of a relational DBMS Data Archival: Dump and restore Data Query: selection, projection, join
Extensions Not all SQL semantics are supported complex WHERE clause: non-equi selection complex join: non-equi-join multi-table join File system semantic: how to efficiently support modification? Scalability to large data sets Parallelism approximate string matching? Selection and projection using one oligo?
References DNA Synthesizing: https://www.youtube.com/watch?v rD5uNAMbDaQ PCR: https://www.youtube.com/watch?v 3XPAp6dgl14 DNA Sequencing: https://www.youtube.com/watch?v ONGdehkB8jU Gibson Assembly Cloning: https://www.youtube.com/watch?v tlVbf5fXhp4
Puddle: A Dynamic, ErrorCorrecting, Full-Stack Microfluidics Platform ASPLOS’2019 Presented by Bingzhe Li 2020/01/14
Lighting Talk https://www.youtube.com/watch?v uwiINEcYXLQ&list PL T9xA0eFRMdxHuQwkhLoa5No-wTSx3Tc&index 14&t 0s
Motivation Microfluidic device automates wetlab procedures by manipulating small chemical or biological samples. Current designs: Inflexible Error-prone Prohibitively expensive Difficult to program
Microfluidic Hardware (a) Channel-based: offer high precision and low cost at scale. (b) Liquid handling robots: more general, aiming to emulate a lab technician with robotic arms controlling pipettes (c) DMF: flexibility at small size and potentially at low cost called electrowetting on dielectric: activating electrodes in certain patterns can move, mix, or split droplets anywhere on the chip
Basic Controls in DMF devices Turning individual electrodes on or off To move a droplet from one location to another, a controller must activate the electrodes along that path in sequence DAG (directed acyclic graph issue)/routine issue in VLSI
Dynamic Microfuidic Programming Vision of this paper: a microfluidics platform that can combine computation and fluidic manipulation in an unrestricted, high-level programming model a runtime system that provides a high-level API for microfluidic manipulations. APIs Handling errors Hardware implementation
APIs A example Python program interfacing Complete Puddle API:
Error Handling Two reasons cause API calls fail: Invalid arguments Using a consumed droplet id (not reuse droplets ids that have been consumed) Invalid arguments:
Implementation Three levels of the stack
PurpleDrop DMF Device PurpleDrop, all together, the components cost on the order of 300, orders of magnitude less than most other microfluidic systems. DMF hardware: mother board contains the electrical components such as high-voltage controllers, shift registers, etc. daughterboard contains the electrodes that hold the droplets The daughterboard is removable, allowing different configurations of electrodes with the same motherboard. Electronic control: Raspberry Pi 3B The Raspberry Pi runs Linux and sports a quad-core 1.2 GHz ARMv7 processor as well as GPIO pins Peripherals: a heater, temperature sensor, and the ability to do both input and output of droplets. Input and output are driven by small peristaltic pumps which carry droplets to/from test tube reservoirs or other devices
Planning and Execution The core of planning is placement and routing Allocation constraints: A mix requests a rectangle slightly larger than the droplet to move and agitate droplets Place constraints (heater position) Execution, Monitoring, and Rollback Execution: activating the electrodes Monitoring: computer vision system to check the states of droplets on the board A rollback consists of deleting the record and replanning all commands which have not been completed.
Evaluation Error system Computer vision Error correction PCR and Thermocycling DNA sequencing
Computer vision accuracy Error correction
PCR and Thermocycling Polymerase chain reaction (PCR) using thermocycle as subroutine amplifies DNA in a solution We performed 8 cycles of PCR which required 2 replenishments to avoid evaporation. The procedure doubled the amount of DNA in our 10 microliter sample
DNA Sequencing Using Puddle and the MinION sequencer To our knowledge, this is the first time that computation and protocol execution are merged in this way