Within-host \(\pi_W\) sanity checks#

On this page we perform sanity checks on our estimates of within-host nucleotide diversity \(\pi_W\) from deep genome sequencing data. We observe that \(\pi_W\) has a strikingly bimodal distribution, with a high-\(\pi_W\) peak and a low-\(\pi_W\) peak.

We postulate that the low-\(\pi_W\) peak represents transmission chains where there has been no recent superinfection or cotransmission. We are particularly interested in this value because we can use it to estimate the quantum of transmission \(Q\).

We would like to be sure that the low-\(\pi_W\) peak is not an artefact of genotyping errors, which could potentially arise either during genome sequencing or in the subsequent data pipeline. Ideally we would like to see data from replicated experiments but these are not currently available. Here we perform some basic analyses to reassure ourselves that there is no obvious systematic artefact in our low-\(\pi_W\) estimates.

Source of the data#

  • Data are from the MalariaGEN Pf6 dataset which has data on 6,051,696 variants in 7,113 samples from 30 countries.

  • We select variants that are high-quality biallelic coding SNPs (n = 502,221). High quality means ‘filter_pass == True’ & ‘vqslod > 3’

  • We select samples that passed QC (n = 5,970)

  • We use the data on allele depth for reference and non-reference forms of these biallelic SNPs in individual samples, and from this we calculate within-host heterozygosity for each SNP in each sample.

We will use these files#

wiho_het.npy

  • Gives within-host heterozygosity for each SNP (n = 502,221) in each sample (n = 5,970).

  • As we have selected biallelic SNPs this is easy to calculate from allele depth data.

  • Filename: 230318_wiho_het.npy

  • This is a large file - 11.7GB

region

  • A list of geographical regions for 5,970 selected samples

  • ‘SAM’=South America, ‘WAF’=West Africa, ‘CAF’=Central Africa, ‘EAF’=East African, ‘SAS’=South Asia, ‘WSEA’=Western SE Asia, ‘ESEA’=Eastern SE Asia, ‘OCE’=Oceania

  • Filename: 230316_region

het_per_snp

  • Mean within-host heterozygosity for each SNP, calculated from wiho_het.npy

  • Filename: 230317_het_per_snp

het_per_sample

  • Mean within-host heterozygosity for each sample after filtering by SNP heterozygosity (typically <0.02) and by geographical region.

  • Filenames:

    • All regions (5970 samples): 230317_het_per_sample_ALL

    • Southeast Asia: 230316_het_per_sample_SEA

    • Africa (3314 samples): 230317_het_per_sample_Africa

allele_depth_2.npy

  • dimension 0 = SNPs (n = 512,220)

  • dimension 1 = samples (n = 5,970)

  • dimension 2 = read depth for ref allele [0] and alt allele [1]

  • note that these are all biallelic coding SNPs

  • Filename: 221221_selected_gt_2.npy

# FOR RUNNING ON COLAB

#from google.colab import drive

#drive.mount('/content/drive')

#data_path = "/content/drive/My Drive/Colab Data/Pf6/"
# FOR RUNNING ON COLAB

#import numpy as np
#import json
#import math
#import matplotlib.pyplot as plt
#import statistics as stat
# FOR RUNNING LOCALLY

data_path = 'G:\\My Drive\\Colab Data\\Pf6\\'

import numpy as np
import json
import math
import matplotlib.pyplot as plt
import statistics as stat
with open(data_path + "230317_het_per_sample_ALL", "r") as filepath:
    het_per_sample = json.load(filepath)
    
wiho_het = np.load(data_path + '230318_wiho_het.npy', mmap_mode = 'r')

allele_depth = np.load(data_path + '221221_selected_gt_2.npy', mmap_mode = 'r')
# All regions - only SNPs with het < 0.02

log_het_per_sample = [math.log(x, 10) for x in het_per_sample]
plt.hist(log_het_per_sample, bins = 100)
plt.xlabel('Het per sample')
plt.ylabel('Number of samples')
plt.grid()
plt.show()
_images/aa8e5ca06b7efcf2cb08c09c7fae9cedf36ee160f1f4d9013c4c7d042cfa4fb2.png
snp_count = wiho_het.shape[0]
sample_count = wiho_het.shape[1]

sample_het = np.array(het_per_sample)

lo_het = sample_het < 3e-5
hi_het = sample_het > 3e-4
med_het = (sample_het >= 3e-5) & (sample_het <= 3e-4)
allele_depth_lo = []
allele_depth_hi = []

for i in range(sample_count):
    
    if lo_het[i] == True:
    
        depth = np.mean(allele_depth[:, i, 0]) + np.mean(allele_depth[:, i, 1])

        allele_depth_lo.append(depth)

    if hi_het[i] == True:
    
        depth = np.mean(allele_depth[:, i, 0]) + np.mean(allele_depth[:, i, 1])

        allele_depth_hi.append(depth)
        
mean_depth_lo = stat.mean(allele_depth_lo)

mean_depth_hi = stat.mean(allele_depth_hi)

print("Mean allele depth is {0:4.1f} in low het samples and \
{1:4.1f} in high het samples".format(mean_depth_lo, mean_depth_hi))
Mean allele depth is 75.7 in low het samples and 80.3 in high het samples
variants = []

for i in range(sample_count):
    
    if lo_het[i] == True:
        
        x = np.count_nonzero(wiho_het[: , i])
        
        variants.append(x)
        
mean_variants = stat.mean(variants)

print("The mean number of heterozygous SNPs in a low-het sample is {0:5.1f}".format(mean_variants))

log_variants = [math.log(x, 10) for x in variants]
plt.hist(log_variants, bins = 100)
plt.xlabel('Number of variants')
plt.ylabel('Number of samples')
plt.grid()
plt.show()
The mean number of heterozygous SNPs in a low-het sample is 412.4
_images/1da3fa7c904728ae4b69f3c6d2043258e98b403b912c29a255fdf470aee4d8f0.png
minor_alleles = []

for i in range(50):
    
    alleles = []

    if lo_het[i] == True:
        
        for j in range(snp_count):
        
            x = allele_depth[j, i, 0]
            
            y = allele_depth[j, i, 1]
            
            z = min(x, y)
        
            if z != 0:
                
                alleles.append(z)
                
        minor_alleles.append(sum(alleles))
        
mean_minor_alleles = stat.mean(minor_alleles)

print("The mean number of minor allele reads in a low het \
sample is", str(mean_minor_alleles))
The mean number of minor allele reads in a low het sample is 1684

Understanding the difference between high het and low het samples#

Look at SNPs that are present in both high and low het samples

  • What is their heterozygosity in low het samples vs high het samples?

How does this compare with heterozygosity of SNPs that are found only in low het samples?

# SNPs in both high and low het samples
#    what is their het in low het samples
#    what is their het in high het samples

snp_both_lo = []
snp_both_hi = []
snp_only_lo = []

for i in range(snp_count):
    
    if np.any(wiho_het[i, :][lo_het]) and np.any(wiho_het[i, :][hi_het]):
        
        mean_snp_het = stat.mean(wiho_het[i, :][lo_het]) 
        snp_both_lo.append(mean_snp_het)
        
        mean_snp_het = stat.mean(wiho_het[i, :][hi_het]) 
        snp_both_hi.append(mean_snp_het)
        
    if np.any(wiho_het[i, :][lo_het]) and not np.any(wiho_het[i, :][hi_het]):
        
        mean_snp_het = stat.mean(wiho_het[i, :][lo_het]) 
        snp_only_lo.append(mean_snp_het)
# SNPs present in low not high het samples
# looking at their het in low het samples
log_het_per_snp = [math.log(x, 10) for x in snp_only_lo]
plt.hist(log_het_per_snp, bins = 100)
plt.xlabel('Mean SNP het')
plt.ylabel('Number of SNPs')
plt.grid()
plt.show()
_images/09f9788a194ebc116b9bab0d43f055dd079332ee5bf51af9a1cb5d87aedf61be.png
# SNPs present in both high and low het samples
# looking at their het in low het samples
log_het_per_snp = [math.log(x, 10) for x in snp_both_lo]
plt.hist(log_het_per_snp, bins = 100)
plt.xlabel('Mean SNP het')
plt.ylabel('Number of SNPs')
plt.grid()
plt.show()
_images/f251036ee866c758d3e1afedbb975093be2110d5b314a89d0e8d12b8e7461d30.png
# SNPs present in both high and low het samples
# looking at their het in high het samples
log_het_per_snp = [math.log(x, 10) for x in snp_both_hi]
plt.hist(log_het_per_snp, bins = 100)
plt.xlabel('Mean SNP het')
plt.ylabel('Number of SNPs')
plt.grid()
plt.show()
_images/f05d0df694ad5bea5e05106b183f7224109866f610b5a5f4ae29bf5618ecbdba.png
maf = []

for i in range(sample_count):

    if lo_het[i] == True:
    
        for j in range(snp_count):
        
            x = allele_depth[j, i, 0]
            
            y = allele_depth[j, i, 1]
            
            z = min(x, y)
            
            if z > 0:
    
                a = x + y
    
                if a != 0:
        
                    b = z/a
    
                    maf.append(b)
        
plt.hist(maf, bins = 100)
plt.xlabel('Minor allele frequency')
plt.ylabel('Number of SNPs')
plt.yscale("log")
plt.grid()
plt.show()
_images/3a7b8e1a20c29379ab44b22e9a970945376bcae03ab7fa922eb12c4a583d5aae.png