Within-host \(\pi_W\)#

This code examines the frequency distribution of within-host nucleotide diversity \(\pi_W\) in samples taken from malaria-infected individuals around the world.

  • We find that \(\pi_W\) has a strikingly bimodal distribution (scroll down to see plots showing this).

  • We postulate that one of the peaks (with high \(\pi_W\)) represents hosts that have been superinfected, or who lie on a transmission chain where there is cotransmission following a recent episode of superinfection.

  • We postulate that the other peak (with low \(\pi_W\)) represents transmission chains where there has been no recent superinfection or cotransmission.

  • We can use the low \(\pi_W\) peak to estimate the quantum of transmission with the formula \(Q = \pi_W / 2 \mu\).

About the data#

Data are from the MalariaGEN Pf6 dataset which has data on 6,051,696 variants in 7,113 samples from 30 countries. This notebook uses files that we have created based on:

  • Selected variants that are high-quality biallelic coding SNPs (n = 502,221). High quality means ‘filter_pass == True’ & ‘vqslod > 3’

  • Selected samples that passed QC (n = 5,970)

  • Data on allele depth for reference and non-reference forms of these biallelic SNPs in individual samples, from which we have calculated within-host heterozygosity for each SNP in each sample.

Elsewhere we show an analysis of the SNPs that are heterozygous in samples with low \(\pi_W\) as a sanity check to exclude the possibility that the low \(\pi_W\) peak is a technical artefact due to systematic genotyping errors.

We will use these files#

These files are not currently available via Google Colab so this notebook is for info only

wiho_het.npy

  • Gives within-host heterozygosity for each SNP (n = 502,221) in each sample (n = 5,970).

  • As we have selected biallelic SNPs this is easy to calculate from allele depth data.

  • Filename: 230318_wiho_het.npy

  • This is a large file - 11.7GB - so it takes a while to process

region

  • A list of geographical regions for 5,970 selected samples

  • ‘SAM’=South America, ‘WAF’=West Africa, ‘CAF’=Central Africa, ‘EAF’=East African, ‘SAS’=South Asia, ‘WSEA’=Western SE Asia, ‘ESEA’=Eastern SE Asia, ‘OCE’=Oceania

  • Filename: 230316_region

het_per_snp

  • Mean within-host heterozygosity for each SNP, calculated from wiho_het.npy

  • Filename: 230317_het_per_snp

het_per_sample

  • Mean within-host heterozygosity for each sample after filtering by SNP heterozygosity (typically <0.02) and by geographical region.

  • Filenames:

    • All regions (5970 samples): 230317_het_per_sample_ALL

    • Southeast Asia: 230316_het_per_sample_SEA

    • Africa (3314 samples): 230317_het_per_sample_Africa

# This code currently requires the above files to be mounted on the user's Google Drive

from google.colab import drive

drive.mount('/content/drive')

data_path = "/content/drive/My Drive/Colab Data/Pf6/"
Mounted at /content/drive
import numpy as np
import json
import math
import matplotlib.pyplot as plt
import statistics as stat
selected_snps = 502221
selected_samples = 5970

Create snp_filter to filter out possible hyperhets#

  • Here we use het < 0.02 as our criterion for inclusion

with open(data_path + "230317_het_per_snp", "r") as fp:
    het_per_snp = json.load(fp)

snp_filter = []

for i in range(selected_snps):
    
    if het_per_snp[i] < 0.02:
        
        snp_filter.append(True)
    
    else:
        snp_filter.append(False)
        
filtered_snps = snp_filter.count(True)
        
print('This includes', str(filtered_snps), 'SNPs')
This includes 496459 SNPs

Create sample_filter to select a particular geographical region#

  • ‘SAM’=South America, ‘WAF’=West Africa, ‘CAF’=Central Africa, ‘EAF’=East African, ‘SAS’=South Asia, ‘WSEA’=Western SE Asia, ‘ESEA’=Eastern SE Asia, ‘OCE’=Oceania

with open(data_path + "230316_region", "r") as fp:
    region = json.load(fp)

sample_filter = []

for i in range(selected_samples):
    
    if region[i] == "CAF" or region[i] == "EAF" or region[i] == "WAF":
        
        sample_filter.append(True)
        
    else:
        
        sample_filter.append(False)
        
filtered_samples = sample_filter.count(True)
    
print('This includes', str(filtered_samples), 'samples')
This includes 3314 samples

Calculate mean heterozygosity per sample after filtering samples and SNPs#

  • Creates a list of het values, one for each selected & filtered sample

  • This is the mean het for selected & filtered SNPs and not for all nucleotide positions

  • On Colab this should take approximately 15 minutes depending on the selection of regions

  • Save the output for future use

wiho_het = np.load(data_path + '230318_wiho_het.npy', mmap_mode = 'r')

het_per_sample = []

for i in range(selected_samples):
    
    if sample_filter[i] == True:

        het = []
    
        for j in range(selected_snps):
    
            if snp_filter[j] == True: 
            
                het.append(wiho_het[j, i])
    
        y = stat.mean(het)
    
        het_per_sample.append(y)
    
    if i % 10 == 0:
        
        print(i)
        
with open(data_path + "230317_het_per_sample_Africa", "w") as filepath:
    json.dump(het_per_sample, filepath)

Plot histogram of within-host nucleotide diversity#

# All regions - only SNPs with het < 0.02

with open(data_path + "230317_het_per_sample_ALL", "r") as filepath:
    het_per_sample = json.load(filepath)
    
selected_snps = 496459 # this is <0.02 het, total SNPs is 502221
coding_positions = 12028350 # from Gardner et al 2002

convert_het_to_pi = selected_snps / coding_positions

het_per_sample_nonzero = [convert_het_to_pi * x for x in het_per_sample if x != 0]
log_het_per_sample = [math.log(x, 10) for x in het_per_sample_nonzero]
plt.hist(log_het_per_sample, bins = 100)
plt.xlabel('Within-host nucleotide diversity \u03C0')
plt.ylabel('Number of samples')
plt.grid()
plt.show()
_images/d4803d47a203c83060e2099366680216c64dbed9b9e6a34934d684d924e10f20.png
# Africa - only SNPs with het < 0.02

with open(data_path + "230317_het_per_sample_Africa", "r") as filepath:
    het_per_sample = json.load(filepath)
    
selected_snps = 496459 # this is <0.02 het, total SNPs is 502221
coding_positions = 12028350 # from Gardner et al 2002

convert_het_to_pi = selected_snps / coding_positions

het_per_sample_nonzero = [convert_het_to_pi * x for x in het_per_sample if x != 0]
log_het_per_sample = [math.log(x, 10) for x in het_per_sample_nonzero]
plt.hist(log_het_per_sample, bins = 100)
plt.xlabel('Within-host nucleotide diversity \u03C0')
plt.ylabel('Number of samples')
plt.grid()
plt.show()
_images/75f9e2c5e1abc606c4674e3a1099fb42004f13f483ad404e6bf33a9f38be465f.png
# Southeast Asia - only SNPs with het < 0.02

with open(data_path + "230316_het_per_sample_SEA", "r") as filepath:
    het_per_sample = json.load(filepath)
    
selected_snps = 496459 # this is <0.02 het, total SNPs is 502221
coding_positions = 12028350 # from Gardner et al 2002

convert_het_to_pi = selected_snps / coding_positions

het_per_sample_nonzero = [convert_het_to_pi * x for x in het_per_sample if x != 0]
log_het_per_sample = [math.log(x, 10) for x in het_per_sample_nonzero]
plt.hist(log_het_per_sample, bins = 100)
plt.xlabel('Within-host nucleotide diversity \u03C0')
plt.ylabel('Number of samples')
plt.grid()
plt.show()
_images/c6269f1f7029407a5eba23a7a2d49c4f509c05e85add098e70792842af6cfe7b.png