get_diversity#

Once we have created a population by specifying its transmission history and performed a simulation of coalescence times using get_coalescent, we can estimate levels of nucleotide diversity and haplotype homozygosity using get_diversity.

!pip install coalestr
from coalestr import cs
# Specify the transmission history of a population
# Each inner list gives [duration, N, Q, X, M] for a period of history
my_history = [[1000, 30, 10, 0, 0], [1000, 10, 5, 0.5, 0]]

# Create the population
my_population = cs.Population(my_history)

# Obtain coalescence times
my_population.get_coalescent()
Observation time.    Events captured.   Mean coalescence time
                      beho      wiho        beho     wiho
        0             100.0     100.0        25.1     18.9
# Once we have run get_coalescent we can run get_diversity
my_population.get_diversity()
Observation time.  Nucleotide diversity     Haplotype homozygosity
                      beho       wiho           beho       wiho
        0           5.52e-07   4.16e-07       7.52e-01   8.12e-01

beho means a between-host sample and wiho means a within-host sample.

This tells us that the local parasite population has nucleotide diversity of 5.52 x 10^-7 and haplotype homozygosity of 0.752.

The mean within-host nucleotide diversity is 4.16 x 10^-7 and the mean within-host haplotype homozygosity is 0.812.

Haplotype homozygosity by default refers to a locus spanning 2 centimorgans. In the Plasmodium falciparum genome, one centimorgan is approximately 13.5kb, so this is equivalent to a 27kb haplotype locus. We can alter these values as shown below.

# Check the current length of a haplotype locus in kilobases
my_population.locus_kb
27
# Change the length of a haplotype locus to 135kb (~10 centimorgans)
my_population.locus_kb = 135

# Re-estimate haplotype homozygosity for this larger locus
my_population.get_diversity()
Observation time.  Nucleotide diversity     Haplotype homozygosity
                      beho       wiho           beho       wiho
        0           5.52e-07   4.16e-07       3.57e-01   4.97e-01
# If we specify multiple observation times for get_coalescent
# .. these are automatically passed to get_diversity

my_observations = [0, 500, 1000, 1500]

my_population.locus_kb = 27

my_population.get_coalescent(observe = my_observations, show = False)

my_population.get_diversity()
Observation time.  Nucleotide diversity     Haplotype homozygosity
                      beho       wiho           beho       wiho
     1500           8.58e-07   2.20e-07       6.45e-01   8.85e-01
     1000           8.58e-07   2.20e-07       7.17e-01   9.13e-01
      500           5.52e-07   4.16e-07       8.25e-01   8.68e-01
        0           5.52e-07   4.16e-07       7.94e-01   8.44e-01

This simulation with multiple observation times gives the haplotype homozygosity of a 27kb locus as 0.794 (between-host) and 0.844 (within-host) at observation time 0. This is slightly different from the values obtained earlier (0.752 and 0.812) when we ran the same simulation with a single observation time.

The reason for this discrepancy is variation in the effective recombination parameter \(\phi_t\). \(\phi_t\) is determined by haplotype heterozygosity and thus can vary during the course of a simulation. At the outset of a simulation we do not know haplotype heterozygosity and so we make an initial guess at \(\phi_t\) - this is the population attribute phi_seed with a default value of 0.2.

When we make multiple observations, get_diversity estimates the level of haplotype heterozygosity at each observation timepoint and adjusts the value of \(\phi_t\) accordingly. This means that our estimate of haplotype homozygosity at any given timepoint becomes increasingly accurate if we make multiple observations prior to that timepoint.

Data objects#

get_diversity produces an array called diversity that gives nucleotide diversity, haplotype homozygosity and the inbreeding coefficient \(F_{WS}\) for different observation times. If this is a subpopulation embedded within a larger metapopulation, it also gives the population structure metric \(F_{ST}\).

diversity is used for internal calculations and to generate figures and reports. For most purposes it can be ignored but the data can if necessary be retrieved as population attributes.The array is structured like this:

  • axis 0: observation time[i]

  • axis 1: measures of diversity at different observation times

    • 0: observation time

    • 1: between-host nucleotide diversity

    • 2: within-host nucleotide diversity

    • 3: between-host haplotype homozygosity

    • 4: within-host haplotype homozygosity

    • 5: Fws

    • 6: Fst (if this is a subpopulation)

# Example of accessing the diversity array

print("Mean levels of genetic diversity in a within-host sample:")

for i in range(len(my_population.observation_times)):
    
    print("  Obs time {0:1.0f}: nucleotide diversity {1:.2e}, haplotype homozygosity {2:.2f}, Fws {3:.2f}"
        .format(
        my_population.diversity[i, 0],
        my_population.diversity[i, 2],
        my_population.diversity[i, 4],
        my_population.diversity[i, 5]))
Mean levels of genetic diversity in a within-host sample:
  Obs time 0: nucleotide diversity 4.16e-07, haplotype homozygosity 0.84, Fws 0.25
  Obs time 500: nucleotide diversity 4.16e-07, haplotype homozygosity 0.87, Fws 0.25
  Obs time 1000: nucleotide diversity 2.20e-07, haplotype homozygosity 0.91, Fws 0.74
  Obs time 1500: nucleotide diversity 2.20e-07, haplotype homozygosity 0.89, Fws 0.74