# GENOME 541: population genetic inference

## Overview

Population genetics is the study of genetic variation shared among individuals, populations and species. It is a powerful lens on the history of humans and other species, revealing a complex past of migration, replacement, admixture with archaic populations, and dramatic population size changes during and after range expansions. Germline mutation events in the history of a population give rise to variant alleles, and demographic histories are recorded in patterns of allele segregation within and between current-day populations.

**Essay:**
K. Harris, *Reading the genome like a history book*. Science 358, 1265 (2017).

*Genetic drift* is the stochastic change in allele frequencies over time due to random outcomes of mutation and survival in a finite population.
Genetic drift is shared when populations mix, it is private during periods of isolation, and the strength of drift is universally influenced by the population size history.
Statistical methods for demographic inference—expressed in the framework of Kimura’s diffusion or Kingman’s coalescent—attempt to recover the effective population size history $N(t)$, where time $t$ is measured in years or generations before the present.

**Review:**
B. Charlesworth, *Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variation*. Nat. Rev. Genet. 10, 195–205 (2009).

In the simplest settings, *demographic inference* means inferring the effective population size history $N(t)$ for a single population, and thus identifying important events such as bottlenecks and range expansions.
More complex models aim to jointly infer the histories of several populations, and include divergence, admixture, and migration between them.
Demographic inference is a rapidly growing area of computational biology.
The field is theoretically rooted in probabilistic population models that predate knowledge of the structure of DNA, but have become applicable—and worth improving—by the need to understand population-scale genome sequencing data.

**Review:**
J. G. Schraiber, J. M. Akey, *Methods and models for unravelling human evolutionary history*. Nat. Rev. Genet. 16, 727–740 (2015).

What can we do with demographic inference?

- Reconstruct the natural history of our favorite species (if we can get our friends to sequence it)
- Elucidate evolutionary forces that act in general cases
- Formulate richer null models for adaptive hypotheses
- Learn about epidemic spread (as you saw last week)
- Inform species conservation efforts
- Clarify interactions of human genomic variation and human health

### Goals

Our focus will be on how the effective population size history $N(t)$ shapes population genomic data.

**Specifically:**

- extend the coalescent theory you learned in Trevor Bedford’s unit to accommodate recombining genomes
- build intuition using simple calculations and simulations
- learn about three statistical paradigms for inferring demographic history:
- Allele frequency spectra
- Distributions of shared haplotype lengths
- Coalescent HMMs

## Lectures

### I: The coalescent (again) (html, pdf)

### II: Coalescent inference for recombining genomes (html, pdf)

## Homework

**Please submit either:**

- a Jupyter notebook for each problem (preferably exported to
`html`

or`pdf`

) - scripts, output images, and text files that are clearly named.

### Problem 1: variance in the coalescent

Suppose you are studying a diploid population with a given constant population size $N = 10,000$. You sequence 10 loci from each of 50 diploid individuals, so $n = 100$ samples for each locus. A “locus” is just a contiguous chunk of DNA sequence. Assume these 10 loci are unlinked, and no recombination happens within each locus. This means that there is an independent coalescent genealogy for each locus. Also suppose that each locus is about the same size, so they share a common mutatation rate $\mu=0.003$ mutations per locus per generation.

**1.1** *(1 point)* What is the expected TMRCA for any given locus?

**1.2** *(1 point)* What is the expected number of segregating sites $S$ arising in one locus? Assume the loci are large enough so that the infinite sites approximation is reasonable (i.e. the number of segregating sites is equal to the number of historical mutation events).

**1.3** *(8 points)* Based on our sequencing, we can measure $S$ for each locus.
Suppose that we find one unusually high diversity locus with $S=1000$.
Your collaborator suggests this high diversity is indicative of diversifying selection at this locus.

To assess this claim, estimate the probability that at least one of the 10 measured $S$ values would be at least this high (i.e. a $p$-value), under the null hypothesis that these are just outcomes from the neutral coalescent process.

Do you agree with your collaborator?

**Hints**

- You will estimate this $p$-value by simulation.
- One round of simulation means simulating a value of $S$ for each of the 10 loci.
- Across many rounds of simulation, we can ask what fraction have a locus with $S\ge1000$.
- You don’t need to simulate the topology of the coalescent trees, only the (exponentially distributed) time until the first event, then until second event, etc., and the (Poisson-distributed) number of mutations that arises conditioned on the intercoalescent times you simulated. See software notes below.

#### Software notes

##### Exponential and Poisson random variables in Python

First import

```
from scipy.stats import expon, poisson
```

To simulate an exponential random variable (rv) with with rate `r`

use

```
expon.rvs(scale=1/r)
```

To simulate a Poisson rv with mean `m`

use

```
poisson.rvs(m)
```

**Warning:**
Note the reciprocal rate parameter above (called a “scale” parameter) used in `scipy`

’s implementation
\[\nonumber{\rm scale}\equiv \frac{1}{\rm rate}\]

##### Exponential and Poisson random variables in R

For exponential

```
rexp(1, r)
```

and for Poisson

```
rpois(1, m)
```

The first argument `1`

indicates that you want one such rv.

#### References

The lectures will contain everything you need to solve the homework problems, but here are a few optional resources.

**Textbook:**
J. Wakeley, *Coalescent Theory: an Introduction*. First Edition. Macmillan. (2009). ISBN:9780974707754

**Review:**
N. A. Rosenberg, M. Nordborg, *Genealogical trees, coalescent theory and the analysis of genetic polymorphisms*. Nat. Rev. Genet. 3, 380–390 (2002).

### Problem 2: inferring demographic history with a coalescent HMM

This problem leverages variability in coalescent time across regions of two haplotypes from a recombining population to estimate demographic history $N(t)$.

Download this VCF file, which contains variant calls for a 10Mb chromosome from 100 diploid individuals ($n = 200$).

**VCF file format:**
The `POS`

field indicates the (0-based) position of each variant in each chromosome.
The individuals are named `tsk_0 tsk_1 ...`

, and their phased genotypes for each variant are indicated as

`0|0`

: homozygous for ancestral (i.e. non-mutant) allele`1|1`

: homozygous for derived (i.e. mutant) allele`0|1`

or`1|0`

: heterozygous

Note that the VCF file does not contain entries for non-segregating sites.

**2.1** *(1 point)* Compute and Plot the sample frequency spectrum (SFS).
Use log scaling for both x and y axes.
Does this look (qualitatively) like the SFS for a constant population history?
Briefly explain why or why not.

**2.2** *(2 points)* Compute the sequence divergence between the two haplotypes from individual `tsk_0`

in 100 kB windows across the chromosome, and plot this.
Produce a similar plot for the first 10 individuals `tsk_0 ... tsk_9`

**2.3** *(7 points)* Use the Pairwise Sequential Markov Coalescent (PSMC) algorithm (see software notes below) to infer the demographic history $N(t)$ from each of the first 10 individuals `tsk_0 ... tsk_9`

.
Plot these 10 histories on the same axes.
Briefly describe the event in the history of this population that gave rise to the SFS distortion.

#### Software notes

##### Pairwise Sequential Markov Coalescent (PSMC)

**Install PSMC:**
Click the green “Clone or download” button, then “Download ZIP”.
This will download `psmc-master`

, which you might want to move to your current working directory, and then unzip.
Navigate to the downloaded directory:

```
cd psmc-master
```

Build the software with:

```
make
cd utils
make
```

Navigate back to your working directory:

```
cd ../..
```

To test the the program runs, the following should print a usage message:

```
psmc-master/psmc
```

**PSMC input format:**
PSMC takes a FASTA-like input format for one diploid individual, where the sequence has a character for all sites in the contig, the character `T`

indicates the two haplotypes are identical at that site (`0|0`

or `1|1`

), and the character `K`

indicates they differ (`0|1`

or `1|0`

), e.g.:

```
>tsk_0
TKTT...
```

The FASTA header name (`tsk_0`

here) is not relevant.
For each individual you want to run PSMC on, you will need to create such a FASTA file by parsing genotypes in the VCF.

**Run PSMC:**
Use the following command to run PSMC on an input file in T/K format `diploid_0.fa`

, and produce an output file `diploid_0.psmc`

:

```
psmc-master/psmc -t 10 -N 30 -r 2 -o diploid_0.psmc diploid_0.fa
```

This may take several minutes to return.

**PSMC output format:**
The first few lines of the output file contain comments that explain its contents.

```
CC
CC Brief Description of the file format:
CC CC comments
CC MM useful-messages
CC RD round-of-iterations
CC LL \log[P(sequence)]
CC QD Q-before-opt Q-after-opt
CC TR \theta_0 \rho_0
CC RS k t_k \lambda_k \pi_k \sum_{l\not=k}A_{kl} A_{kk}
CC DC begin end best-k t_k+\Delta_k max-prob
CC
```

The values `t_k`

and `\lambda_k`

on the `RS`

lines correspond to times (into the past) and population size, respectively—this represents the $N(t)$ that we’re after.
Iterations of the algorithm print output in blocks separated by a line like

```
//
```

Use only the last such block (the final iteration).
The values of `t_k`

and `\lambda_k`

must be rescaled using an estimate of the mutation rate to get them in units of generations and individuals, respectively.
This is described in the section `APPENDIX I: Scaling the PSMC output`

of the PSMC README file.
Assume a mutation rate of `mu = 1e-8`

, and a bin size `s = 1`

, so the equation in that section reads

```
N_0 = theta_0 / (4 * mu)
```

(note the backslashes in the PSMC equation are not division symbols).

##### Terminal commands in Jupyter cells

You may find it convenient to execute terminal commands (i.e. PSMC) in Jupyter notebooks by beginning the cell with the `!`

character, e.g.

```
!echo foo > bar.txt
```

You can even loop them, mixing Python and terminal commands, e.g.

```
for i in range(10):
!echo {i} >> baz.txt
```

Note the curly braces grab the python variable `i`

for the terminal command.

#### References

**Original PSMC paper:**
H. Li, R. Durbin, *Inference of human population history from individual whole-genome sequences*. Nature 475, 493–496 (2011).

There are several more cutting-edge coalescent HMM methods, such as MSMC and SMC++. Some might call PSMC deprecated; I call it classic.

**Review of modern coalescent HMMs:**
J. P. Spence, M. Steinrücken, J. Terhorst, Y. S. Song, *Inference of population history using coalescent HMMs: review and outlook*. Curr Opin Genet Dev 53, 70–76 (2018).