Browsing my genome

Back in Feb 2018, I spat in a tube and put it in the post. On April 12, 2018, 23andMe looked at the DNA in that tube. Recently, I discovered that I could download their raw report. Follow along with me as I browse my genome.

23andMe gave me a zip file here, which contains a text file, genome_James_Fisher_v5_Full_20191230024357.txt. The file looks like this:

# This data file generated by 23andMe at: Mon Dec 30 02:43:57 2019
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been
# individually validated for accuracy. As such, this data is suitable only for research,
# educational, and informational use and not for medical or other use.
#
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier
# (an rsid or an internal id), its location on the reference human genome, and the
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing
# improvements in our ability to call genotypes. More information about these changes can be found at:
# https://you.23andme.com/p/0123456789abcdef/tools/data/download/
#
# More information on reference human assembly builds:
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/
#
# rsid       chromosome  position  genotype
rs548049170  1           69869     TT
rs13328684   1           74792     --
rs9283150    1           565508    AA
i713426      1           726912    --
...
...
...
i704756      MT          16524     A
i705255      MT          16525     A
i4000757     MT          16526     G
i701671      MT          16526     G

The first thing I noticed was that this file didn’t “look like DNA”! I was expecting to see strings like CTCATCTCTCTTG.... Where are the famous nucleic acids?

The file links me to the Genome Reference Consortium’s Human Build 37. Sounds awfully sci-fi! From that site, you can download references for each chromosome. For example, chromosome 15 is the reference sequence NC_000015.9. From that page, select “Send to”, choose “File”, with the format “FASTA”, and it will serve you a file sequence.fasta. This 99-megabyte file looks like:

>NC_000015.9 Homo sapiens chromosome 15, GRCh37.p13 Primary Assembly
...
CCTTGTAGAGGCCCCCTGGATGGCACCAAGATCGGCCCTGGCAAGTAGGTGACCCTGACTTCAGAGCCCT
TGCCTGAGGGCCTGGCCTGGCAGCTCTGCTGTTAGAAGCAGGAGGTGTGCAGGGGGTGGGGAGCAGCCCA
GCCTCTGTGATCTTCTCCATGGCAGGATCTCCCAGCAGGTAGAGCAGAGCCGGAGCCAGGTGCAGGCCAT
TGGAGAGAAGGTCTCCTTGGCCCAGGCCAAGATTGAGAAGATCAAGGGCAGCAAGAAGGCCATCAAGGTA
GTCCCCATACCCCTGTGTCCTGAGGCTACTGGGCAGTCCCTCCATTTCCCCGTGCCTCTGAGGCTGCCCA
GTCTCTGCCCTGCTGCCCACCTGTACCTTGAGCTTTCTTCTCGCCCAGGCTTCCAACTCCACCCTCTCCT
...

This looks more like DNA! This file is in FASTA format, which generally shows the full sequence of nucleic acids.

The file that 23andMe gave me is not a full DNA sequence. Instead, each line in is an “SNP”, or a “Single-nucleotide polymorphism”, “a substitution of a single nucleotide that occurs at a specific position in the genome”. My genome file is a “patch” which can be applied to the reference genome to obtain my full genome.

Why does 23andMe give me a list of SNPs, rather than a raw sequence, like a FASTA file? One reason is size. The raw reference chromosome 1 is 220 megabytes, but the list of SNPs for my chromosome 1 is only 1.3 megabytes.

But a more fundamental reason that 23andMe does not give me a raw sequence is that 23andMe does not sequence DNA! Instead, 23andMe do SNP genotyping, using an “SNP array”, a piece of hardware that detects the presence or absence of a chosen set of SNPs. In particular, 23andMe say that my DNA was tested with their “Version 5” genotyping chip, which is the “Global Screening Array” product sold by Illumina. This chip detects “around 650,000” SNPs. Indeed, this is the number of lines in my genome file:

$ wc -l genome_James_Fisher_v5_Full_20191230024357.txt
  638593 genome_James_Fisher_v5_Full_20191230024357.txt

Let’s take a single example from that file. 23andMe reports that I am “likely to have blue or green eyes.” This is correct - I’d say my eyes are blue. In this report, 23andMe bases this report on one SNP: rs12913832. We can see my result in my file:

$ grep rs12913832 genome_James_Fisher_v5_Full_20191230024357.txt
rs12913832	15	28365618	GG

This line has the following meaning: take my chromosome 15. Or rather, my two copies of chromosome 15, since most chromosomes come in pairs. On both, look at position 28,365,618. The report above says that, on both of my chromosomes at this position, you will find guanine.

The string rs12913832 is an RSID, or “Reference SNP cluster ID”. You can see this SNP in the official SNP database managed by the NCBI. There, you can see that around 82% of chromosome 15s globally have adenine at position 28,365,618.

Let’s see that for ourselves. Here’s some JavaScript that will highlight the position in our downloaded sequence.fasta for chromosome 15:

const fs = require('fs');
const es = require('event-stream');

const target = 28365618;

const startPrintingAt = target-140;
const stopPrintingAt = target+140;

let position = 1;  // positions are 1-indexed!

fs.createReadStream('sequence.fasta')
  .pipe(es.split())
  .pipe(es.mapSync(line => {
    if (line.startsWith('>')) return;  // ignore FASTA comments
    const nextPosition = position+line.length;
    if (position >= startPrintingAt && position < stopPrintingAt) {
      console.log(line);
    }
    if (position <= target && target < nextPosition) {
      console.log(' '.repeat(target-position) + '^');
    }
    position = nextPosition;
  }));

And here it is in action:

$ node highlightPosition.js
ACAGGAACAAAGAATTTGTTCTTCATGGCTCTCTGTGTCTGATCCAAGAGGCGAGGCCAGTTTCATTTGA
GCATTAAATGTCAAGTTCTGCACGCTATCATCATCAGGGGCCGAGGCTTCTCTTTGTTTTTAATTAATTG
       ^
TTTTTAACTGTGAGTTTATATACACTTGAAGCAGTATACATTTAGAAATGGTCTACTTGTCGTTTCTTTG
ATTACTACCCATGAGACAGTATTAGTAATTCTGGCCTATGAAATTGGCAAAGAAAACTACCAGTGGTGGG

Sure enough, it highlights the A, adenine, that is most commonly found at this position.

Discussion on Hacker News.
Tagged #programming, #bioinformatics.

Similar posts

More by Jim

👋 I'm Jim, a full-stack product engineer. Want to build an amazing product and a profitable business? Read more about me or Get in touch!

This page copyright James Fisher 2019. Content is not associated with my employer. Found an error? Edit this page.