Browsing my genome

Back in Feb 2018, I spat in a tube and put it in the post. On April 12, 2018, 23andMe looked at the DNA in that tube. Recently, I discovered that I could download their raw report. Follow along with me as I browse my genome.

23andMe gave me a zip file here, which contains a text file, genome_James_Fisher_v5_Full_20191230024357.txt. The file looks like this:

# This data file generated by 23andMe at: Mon Dec 30 02:43:57 2019
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been
# individually validated for accuracy. As such, this data is suitable only for research,
# educational, and informational use and not for medical or other use.
#
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier
# (an rsid or an internal id), its location on the reference human genome, and the
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing
# improvements in our ability to call genotypes. More information about these changes can be found at:
# https://you.23andme.com/p/0123456789abcdef/tools/data/download/
#
# More information on reference human assembly builds:
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/
#
# rsid       chromosome  position  genotype
rs548049170  1           69869     TT
rs13328684   1           74792     --
rs9283150    1           565508    AA
i713426      1           726912    --
...
...
...
i704756      MT          16524     A
i705255      MT          16525     A
i4000757     MT          16526     G
i701671      MT          16526     G

The first thing I noticed was that this file didn’t “look like DNA”! I was expecting to see strings like CTCATCTCTCTTG.... Where are the famous nucleic acids?

The file links me to the Genome Reference Consortium’s Human Build 37. Sounds awfully sci-fi! From that site, you can download references for each chromosome. For example, chromosome 15 is the reference sequence NC_000015.9. From that page, select “Send to”, choose “File”, with the format “FASTA”, and it will serve you a file sequence.fasta. This 99-megabyte file looks like:

>NC_000015.9 Homo sapiens chromosome 15, GRCh37.p13 Primary Assembly
...
CCTTGTAGAGGCCCCCTGGATGGCACCAAGATCGGCCCTGGCAAGTAGGTGACCCTGACTTCAGAGCCCT
TGCCTGAGGGCCTGGCCTGGCAGCTCTGCTGTTAGAAGCAGGAGGTGTGCAGGGGGTGGGGAGCAGCCCA
GCCTCTGTGATCTTCTCCATGGCAGGATCTCCCAGCAGGTAGAGCAGAGCCGGAGCCAGGTGCAGGCCAT
TGGAGAGAAGGTCTCCTTGGCCCAGGCCAAGATTGAGAAGATCAAGGGCAGCAAGAAGGCCATCAAGGTA
GTCCCCATACCCCTGTGTCCTGAGGCTACTGGGCAGTCCCTCCATTTCCCCGTGCCTCTGAGGCTGCCCA
GTCTCTGCCCTGCTGCCCACCTGTACCTTGAGCTTTCTTCTCGCCCAGGCTTCCAACTCCACCCTCTCCT
...

This looks more like DNA! This file is in FASTA format, which generally shows the full sequence of nucleic acids.

The file that 23andMe gave me is not a full DNA sequence. Instead, each line in is an “SNP”, or a “Single-nucleotide polymorphism”, “a substitution of a single nucleotide that occurs at a specific position in the genome”. My genome file is a “patch” which can be applied to the reference genome to obtain my full genome.

Why does 23andMe give me a list of SNPs, rather than a raw sequence, like a FASTA file? One reason is size. The raw reference chromosome 1 is 220 megabytes, but the list of SNPs for my chromosome 1 is only 1.3 megabytes.

But a more fundamental reason that 23andMe does not give me a raw sequence is that 23andMe does not sequence DNA! Instead, 23andMe do SNP genotyping, using an “SNP array”, a piece of hardware that detects the presence or absence of a chosen set of SNPs. In particular, 23andMe say that my DNA was tested with their “Version 5” genotyping chip, which is the “Global Screening Array” product sold by Illumina. This chip detects “around 650,000” SNPs. Indeed, this is the number of lines in my genome file:

$ wc -l genome_James_Fisher_v5_Full_20191230024357.txt
  638593 genome_James_Fisher_v5_Full_20191230024357.txt

Let’s take a single example from that file. 23andMe reports that I am “likely to have blue or green eyes.” This is correct - I’d say my eyes are blue. In this report, 23andMe bases this report on one SNP: rs12913832. We can see my result in my file:

$ grep rs12913832 genome_James_Fisher_v5_Full_20191230024357.txt
rs12913832	15	28365618	GG

This line has the following meaning: take my chromosome 15. Or rather, my two copies of chromosome 15, since most chromosomes come in pairs. On both, look at position 28,365,618. The report above says that, on both of my chromosomes at this position, you will find guanine.

The string rs12913832 is an RSID, or “Reference SNP cluster ID”. You can see this SNP in the official SNP database managed by the NCBI. There, you can see that around 82% of chromosome 15s globally have adenine at position 28,365,618.

Let’s see that for ourselves. Here’s some JavaScript that will highlight the position in our downloaded sequence.fasta for chromosome 15:

const fs = require('fs');
const es = require('event-stream');

const target = 28365618;

const startPrintingAt = target-140;
const stopPrintingAt = target+140;

let position = 1;  // positions are 1-indexed!

fs.createReadStream('sequence.fasta')
  .pipe(es.split())
  .pipe(es.mapSync(line => {
    if (line.startsWith('>')) return;  // ignore FASTA comments
    const nextPosition = position+line.length;
    if (position >= startPrintingAt && position < stopPrintingAt) {
      console.log(line);
    }
    if (position <= target && target < nextPosition) {
      console.log(' '.repeat(target-position) + '^');
    }
    position = nextPosition;
  }));

And here it is in action:

$ node highlightPosition.js
ACAGGAACAAAGAATTTGTTCTTCATGGCTCTCTGTGTCTGATCCAAGAGGCGAGGCCAGTTTCATTTGA
GCATTAAATGTCAAGTTCTGCACGCTATCATCATCAGGGGCCGAGGCTTCTCTTTGTTTTTAATTAATTG
       ^
TTTTTAACTGTGAGTTTATATACACTTGAAGCAGTATACATTTAGAAATGGTCTACTTGTCGTTTCTTTG
ATTACTACCCATGAGACAGTATTAGTAATTCTGGCCTATGAAATTGGCAAAGAAAACTACCAGTGGTGGG

Sure enough, it highlights the A, adenine, that is most commonly found at this position.

Discussion on Hacker News.

Tagged #programming, #bioinformatics.

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2019. Content is not associated with my employer. Found an error? Edit this page.

Browsing my genome

Similar posts

More by Jim