Population genetic study of repetitive DNA regions reveals new biological mechanisms in the general population and uncovers harmful mutations associated with Autism

Ileena Mitra
5 min readJan 14, 2021

By Ileena Mitra

The advent of DNA sequencing has created a boom of large-scale genomic data. DNA sequencing has become relatively inexpensive, close to only $100 per genome. Thousands of individuals are having their genomes sequenced for research and clinical use. Yet, how do we make sense of all this information? What does it mean for our health? Can we learn any new biology from all this data?

Current technological limitations make it difficult to inspect the complete spectrum of human genetic variation, therefore leaving large sections of generated sequencing data sets uninterpretable. This limitation has been a barrier to studying certain mutations in DNA and how these mutations may contribute to human diseases.

Our new study, published on January 13th, 2021 in Nature (1), has made major progress in helping to solve a major barrier in population genomic studies. The Gymrek lab developed new software to analyze and interpret whole-genome sequencing (WGS) data. We applied our novel methods to thousands of individuals’ genomes to learn more about the mechanism of DNA mutations and how mutations might lead to complex human disorders.

A category of DNA variants, known as tandem repeats, are responsible for many neurological disorders and cancers, yet have been difficult to study on a large population scale. Tandem repeats are a pattern of 1 to 20 nucleotides (where each nucleotide can be either “A”, “C”, “G”, “T”) that are repeated two or more times in a given location in the DNA. The number of times the pattern (e.g. “CAG”) is repeated is often unique to the individual.

Mutations in DNA can arise spontaneously from parents to children, such that the child can have a DNA variant that is not found in the genomes of the biological parents. For example, if a specific region is repeated 8 times in a parents’ DNA, it may be repeated fewer times (contraction, e.g. 7x) or more times (expansion, e.g. 10x) in the child. Occasionally these genetic mutations can have a severely detrimental impact and lead to diseases.

Scientific illustration of a spontaneous tandem repeat mutation of +2 repeats in a child.

Our research is the first to examine spontaneous tandem repeat mutations across the whole human genome in the general population, and this uniquely positioned us to study novel tandem repeat mutations in Autism Spectrum Disorder (autism). Autism is characterized by a clinically heterogeneous set of neurodevelopmental symptoms and is recognized primarily as a genetic condition (2). However, currently, only a small proportion of autism diagnoses can be explained by known genes associated with autism. Therefore, a large proportion of genes contributing to the risk of developing autism remains to be discovered.

CDC estimates 1 in 68 children has been identified with autism spectrum disorder.

We assessed over 1 million tandem repeat positions across the full genome of over 6000 individuals, which included healthy parents, healthy children, and children diagnosed with autism. Our research group developed new statistical methods to identify step-wise tandem repeat contraction or expansion mutations (MonSTR) and score the mutations based on their likelihood to cause diseases (SISTR).

Graph illustrating a tandem repeat mutation of -2 repeats in a child (1).

We found several interesting patterns in our results that inform us about the biology of DNA mutational mechanism. In general, it is well known that a higher number of spontaneous mutations arise from the father’s DNA, and our results support this. Also, a new finding we show is that tandem repeat mutations derived from the mother’s DNA are on average larger in size compared to those from the father’s DNA, lending new insights into the different biological processes (e.g. DNA repair mechanisms) in egg versus sperm cells.

Graph illustrating the average size of tandem repeat mutations from fathers versus mothers (1).

We also found several significant differences between children diagnosed with autism and their healthy siblings. Interestingly, children with autism had, on average, both more and longer tandem repeat contractions and expansion mutations than their healthy sibling. In addition, children with autism had more mutations enriched in brain-related regulatory regions, affecting genes expressed during prenatal brain developmental time points.

Graph illustrating the expression of genes affected by tandem repeat mutations in autistic versus healthy children (1).

We found that there is an ​average of 54 spontaneous tandem repeat mutations across all individuals. Generally, most spontaneous mutations that arise are harmless, and a critical source of human evolution. Therefore, we wanted to determine which of these mutations are actually harmful to individuals with autism. However, finding specific disease-related mutations is similar to finding a needle in a haystack.

Graph illustrating the distribution of tandem repeat mutations in autistic versus healthy children (1).

We addressed this problem by assessing tandem repeat mutations in the context of natural selection. Most disease-causing mutations are uncommon in the general population because over time natural selection has removed harmful mutations.

We found there are about 2.5x more rare and pathogenic tandem repeat mutations in children with autism compared to healthy children. We found the tandem repeat mutations that are most likely to be harmful to individuals happen to implicate potential new genes​ associated with autism, as well as confirm known autism-associated genes. Overall, the results of our new study suggest ​ tandem repeat mutations make a significant contribution to autism.

So, how do we make sense of all this information? What does it mean for our health? Can we learn any new biology from all this data?

The information gained from our research helps scientists learn new biology. We have a slightly better understanding of the mutational mechanisms of tandem repeat variation in the human genome. This work also sheds light for scientists on the role of tandem repeat DNA mutations contributing to the risk of autism. Understanding the biological cause of autism may, in turn, allow for developing biomarker-based diagnostic tests and personalized treatments.

The new open-sourced algorithms by The Gymrek Lab can be applied to new genomic datasets as DNA sequencing continues to expand. It allows other scientists to apply these methods to uncover new insights of tandem repeat mutations related to other complex human diseases.

This work would not be possible without the generosity of all research participants. Thank you!

Read my full research study “Patterns of de novo tandem repeat mutations and their role in autism” on Nature at https://www.nature.com/articles/s41586-020-03078-7.

Read an independent summary of the research in perspective to the broader genomics field, “Repeat DNA expands our understanding of autism spectrum disorder” on Nature at https://www.nature.com/articles/d41586-020-03658-7.


  1. I. Mitra, et al. Nature (2021).
  2. P.F. Sullivan and D.H. Geschwind. Cell (2019).