A metagenomic DNA sequencing assay that is robust against

SIFT-seq working principle

For the practical implementation of SIFT-seq, we tag DNA by bisulfite salt-induced conversion of unmethylated cytosines to uracils (Fig. 1a). Uracils created by bisulfite treatment are converted to thymines in subsequent DNA synthesis steps that are part of DNA sequencing library preparation. After DNA sequencing, contaminating DNA introduced after tagging can then be directly identified based on the lack of cytosine conversion. Bisulfite conversion does not require the use of commercial enzymes or oligos that are a frequent source of DNA contamination, and we found that it can be applied directly to the original sample, before DNA isolation. We developed a bioinformatics procedure to differentiate sample-intrinsic microbial DNA, contaminant microbial DNA, and host-specific DNA after SIFT-seq tagging (Fig. 1b, Methods). This procedure consists of three steps. First, host cfDNA is removed via mapping and k-mer matching. Given that CpG dinucleotides are heavily methylated in the human genome and rarely in microbial genomes, sequences containing CG dinucleotides are also removed. Second, remaining sequences that consist of more than three cytosines, or one cytosine-guanine dinucleotide are flagged and removed as likely contaminants. Last, a species-level filtering step is performed to remove any remaining reads that primarily originate from C-poor regions in the reference genome (Fig. 1c, Methods).

Fig. 1: SIFT-seq proof-of-principle.
figure 1

a Experimental workflow. Tagging of sample-intrinsic DNA by bisulfite DNA treatment is performed directly on urine or plasma. Contaminating DNA introduced after the tagging step is identified based on lack of cytosine conversion. b Bioinformatics workflow. c Representative example of the cytosine fraction of mapped reads in an unfiltered (top) dataset, a read-level filtered dataset (middle) and a fully filtered dataset (bottom). d Number of reads assigned to Cutibacterium acnes (common environmental DNA contaminant) in ΦX174 DNA after conventional sequencing (green) and SIFT-seq (purple). e Deliberate contamination assay. Detection of known contaminants before (top) and after (bottom) filtering. f Number of reads assigned to contaminants. Boxes in the boxplots indicates 25th and 75th percentile, the band in the box indicated the median and whiskers extend to 1.5 × Interquartile Range (IQR) of the hinge. Outliers (beyond 1.5 × IQR) are plotted individually. Source data for (df) are provided as a Source Data file.

We devised two assays to test the principle of SIFT-seq. First, we applied SIFT-seq and conventional DNA sequencing to samples of sheared ΦX174 DNA (New England Biolabs, #N3021S) with variable biomass (0.0025 ng, 0.025 ng, 0.25 ng, 2.5 ng, 26 ng, and 155 ng for SIFT-seq; 0.004 ng, 0.04 ng, 0.4 ng, 4 ng, 35 ng, and 240 ng for standard cfDNA sequencing). We first quantified the abundance of Cutibacterium acnes (C. acnes), which is a frequent member of the normal skin flora and is routinely identified as a contaminant in DNA sequencing7. We observed an increase in C. acnes abundance with decreasing input biomass, as expected given that samples with a lower biomass are more susceptible to environmental contamination (Fig. 1d). We found that despite a ~30% lower biomass at the beginning of library preparation for the SIFT-seq samples, far fewer C. acnes reads were present after SIFT-seq filtering (4223.8 and 119.5 MPM in the highest biomass samples, 1.48 and 0 MPM in the lowest biomass samples, before and after SIFT-seq filtering respectively; Fig. 1d).

Second, we performed SIFT-seq on sheared ΦX174 DNA samples with variable biomass (0.0025–155 ng; Fig. 1e) which we spiked after SIFT-seq tagging with 1 ng of sheared DNA from a well-characterized community of microbes to simulate microbial DNA contamination (10 species; Zymo Research, #D6305). Before applying the SIFT-seq bioinformatics filter, we observed a negative correlation between the ΦX174 DNA input biomass and the relative number of reads from the spike-in community, as expected (Pearson’s R = −0.54, p value = 6.5 × 10−6; Spearman’s ρ = −0.82, p value = 6.3 × 10−16; Fig. 1e). After applying the SIFT-seq filter, we observed an average percent decrease of 99.8% of molecules mapping to species of the spike-in community (Fig. 1f). Sequences mapping to Escherichia coli (E. coli) were the most abundant after filtering (58.89%). Given that ΦX174 genomic DNA is isolated after phage propagation in E. coli culture, we reasoned that these remaining reads were likely intrinsic to the original sample. Together, these experiments demonstrate the effectiveness of SIFT-seq for the detection and removal of DNA contaminants without removing species originally present in the sample.

Application of SIFT-seq to cell-free DNA in blood and urine

Cell-free DNA (cfDNA) in blood and urine has emerged as a useful analyte for the diagnosis of infection8,9,10,11,12,13,14,15. Metagenomic cfDNA sequencing can identify a broad range of potential pathogens with high sensitivity. Yet, because of the low biomass of microbial-derived cfDNA in blood and urine, metagenomic cfDNA sequencing is highly influenced by environmental contamination, limiting the specificity of metagenomic cfDNA sequencing for pathogen identification.

To assess the performance of SIFT-seq in metagenomic cfDNA sequencing, we assayed a total of 196 cfDNA samples (154 plasma, 42 urine) collected from five groups of subjects: (1) 30 plasma samples from a cohort of 14 patients hospitalized with COVID-19 (“COVID19 cohort”), (2) 53 plasma samples from a cohort of 44 patients seeking treatment for IBD (4 patients without IBD, 19 patients with Crohn’s disease, 21 patients with ulcerative colitis; “IBD cohort”), (3) 56 plasma samples from a cohort of 44 patients presenting with respiratory symptoms at outpatient clinics in Uganda (“Uganda cohort”), (4) 15 plasma samples from a cohort of 15 patients (10 patients with sepsis, 5 patients without sepsis but in the ICU; “sepsis cohort”), (5) 26 urine samples from a cohort of kidney transplant patients with and without urine culture-confirmed UTIs (16 positive urine culture, 10 negative urine culture; “kidney transplant cohort”) and (6) 16 urine samples collected early after transplantation from 10 kidney transplant patients that received a ureteral stent at the time of transplantation (samples were collected pre-stent and post-stent removal for 5 of the 10 patients; “early post-transplant cohort”; see Supplementary Table 1 and Supplementary Information for details on the patients and samples included).

We performed SIFT-seq for all samples and obtained an average of 48.5 ± 23.4 million paired-end reads per sample. We detected and quantified the abundance of 68 genera that have been reported as frequent DNA contaminants in multiple independent studies (summarized in Ref. 4; Fig. 2a; 49 of these genera detected in at least one sample). We found that 77% of these genera were completely removed from all samples after SIFT-seq filtering. We calculated the total number of molecules from all contaminant genera and observed an up to three orders of magnitude reduction after SIFT-seq filtering (reduced by a factor of 7.5, 1,711.2, 177.6, 608.8, 215.4, 547.2; two tailed, Wilcoxon signed-rank test, p values < 0.001 for all cohorts; Fig. 2b). We investigated the impact of SIFT-seq filtering on removing reads originating from the skin contaminant C. acnes (Fig. 2c). C. acnes was detected in all samples and completely removed from 62 samples by SIFT-seq filtering. In the remaining samples, we observed an up to two orders of magnitude reduction of C. acnes reads (two tailed, Wilcoxon signed-rank test, p values < 0.001 for all cohorts).

Fig. 2: SIFT-seq applied to cell-free DNA in urine and plasma.
figure 2

a Microbial abundance of 25 most abundant common contaminant genera (selected from the 68 genera4) before and after SIFT-seq filtering in plasma and urine from six independent subject cohorts (Tx = transplant). Total abundance of all contaminant genera (b) and C. acnes (c) before and after SIFT-seq filtering (KUCP = Kidney Transplant cohort with positive urine culture, KUCN = Kidney Transplant cohort with negative urine culture, EPTx = Early Post Transplant cohort). Bray–Curtis dissimilarity index before (d) and after (e) filtering. Samples are organized by: sequencing batch, researcher performing the experiment, cohort, and biofluid. Boxes in the boxplots indicates 25th and 75th percentile, the band in the box indicated the median and whiskers extend to 1.5 × Interquartile Range (IQR) of the hinge. Outliers (beyond 1.5 × IQR) are plotted individually. ***p value < 0.001. Source data are provided as a source data file.

We next evaluated the utility of SIFT-seq to correct for batch effects and to reveal true differences in microbiome profiles for different patient groups. To this end, we calculated the Bray–Curtis Dissimilarity Index for all clinical samples included in this study and sorted the datasets based on the following parameters: (1) sequencing run, (2) operator, (3) urine culture test, (4) study cohort, and (5) biofluid type. Before SIFT-seq filtering, we observed a high similarity for samples assayed in the same experimental batches (Fig. 2d). SIFT-seq filtering removed these batch effects and revealed distinct cohort-specific microbiome profiles. Most notably, we observed distinct plasma microbiome profiles for plasma samples from the Uganda cohort (Fig. 2e). These results demonstrate that SIFT-seq directly applied to biofluids leads to a dramatic decrease in experimental noise and bias due to DNA contamination.

SIFT-seq enables to screen for UTI and to characterize the urine microbiome

The healthy urinary tract was long believed to be sterile16,17, but this picture was challenged with recent advances in urine culture techniques that have identified bacteria in the urinary tract of both males and females18. Yet many microbes are difficult to cultivate in vitro, and bacterial culture can also be sensitive to contamination19. Therefore, comprehensive and accurate characterization of species colonizing the urinary microbiome is still lacking.

We reasoned that SIFT-seq could provide insight into the composition of the urine microbiome with both high sensitivity and specificity. We first applied SIFT-seq to 26 urine samples from 23 kidney transplant patients with and without infection of the urinary tract as determined by conventional urine culture (16 positive urine culture [Enterococcus faecalis: n = 3; Enterococcus faecium: n = 1; E. coli: n = 10; Klebsiella pneumoniae: n = 1; Pseudomonas aeruginosa: n = 1] and 10 negative urine culture). SIFT-seq consistently identified microbial cfDNA from species reported by urine culture (16/16 urine culture positive samples; Fig. 3a). SIFT-seq also identified two Corynebacterium species (Corynebacterium jeikeium and Corynebaterium urealyticum) in one sample from a urine culture positive patient (E. coli) with culture confirmed Corynebacterium co-infection. In addition, we found that samples from positive urine culture patients had a significantly higher burden of total microbial DNA compared to samples from negative urine culture patients (1451.8 ± 3024.7 MPM and 12.8 ± 17.6 MPM, respectively in the filtered samples; p value = 7.1 × 10−4, two tailed, Wilcoxon rank-sum test, Fig. 3b). Conventional metagenomic sequencing (without SIFT-seq filtering) detected uropathogens with equal sensitivity but was not robust against environmental contamination: DNA from common uropathogens not identified by culture was detected in many samples, albeit with low abundance, including in samples from patients without urine culture-confirmed UTIs. We conclude that the improved specificity of SIFT-seq allows for more accurate characterization of co-infection networks in the scope of UTIs, and more accurate characterization of the normal urine microbiome in the absence of UTIs. It is important to note that two common skin microbes, C. acnes and Staphylococcus epidermidis, were found in most samples (23/26 samples). While these two species have been shown to cause UTIs20,21, they may also have been introduced as contaminants at the time of urine collection, which underscores an important limitation of SIFT-seq: SIFT-seq is not robust against contamination that occurs before the tagging step.

Fig. 3: Application of SIFT-seq to urine.
figure 3

a Heatmap of abundance of species (molecules per million, MPM, species with at least one read detected by BLAST) identified in patients with and without urine culture-confirmed UTIs, before and after application of SIFT-seq filter (black * indicates agreement with urine culture). b Boxplot of the relative number of microbe-derived molecules (MPM) in samples from patients with and without urine culture-confirmed UTIs, before and after SIFT-seq filtering. c (i) Sample collection timepoints after transplantation for 5 patients. (ii) Boxplot showing Bray–Curtis similarity index (as defined in c (i)) of the urine microbiome within individual patients and between patients before and after stent removal. Boxes in the boxplots indicates 25th and 75th percentile, the band in the box indicated the median and whiskers extend to 1.5 × Interquartile Range (IQR) of the hinge. Outliers (beyond 1.5 × IQR) are plotted individually. (* p value < 0.05, ** p value < 0.01,*** p value < 0.001). Source data for (ac(ii)) are provided as a source data file.

Studies investigating the temporal dynamics of urine microbiome in individuals can benefit from the high sensitivity and specificity achieved with our assay. We applied SIFT-seq to paired urine samples obtained from five kidney transplant patients collected at two timepoints before and after ureteral stent removal (Fig. 3c(i)). We compared the similarity of microbial composition between samples from the same patient (intra-individual) and between different patients (inter-individual) at different sampling points. Using filtered but not the unfiltered datasets, we observed that the microbial composition remained more similar in the same patient (Fig. 3c(ii) than between different patients, supporting the utility of SIFT-seq to measure subtle dynamics in urine microbiome composition (Mean Bray–Curtis Similarity: 0.41 ± 0.06 and 0.317 ± 0.09 respectively, p value = 2.8 × 10−2, two tailed, Wilcoxon rank-sum test, Supplementary Fig. 1).

To evaluate the performance of SIFT-seq to existing bioinformatic techniques for eliminating environmental DNA contamination, we benchmarked SIFT-seq against Low Biomass Background Correction (LBBC)6, a bioinformatics noise filtering tool for eliminating environmental DNA contamination. LBBC identifies and removes two types of noise: (1) digital cross talk stemming from alignment errors and (2) physical noise arising from environmental DNA contamination present in reagents required for DNA isolation and sequencing libraries preparation. We compared SIFT-seq-filtered and LBBC-filtered data for samples from the kidney transplant cohort (n = 26). On average, LBBC filtering resulted in a 1.4-fold reduction of reads originating from contaminant genera, while SIFT-seq achieved a 7.5-fold reduction (p valueSIFT-seq < 0.001, p valueLBBC < 0.001, two-tailed, Wilcoxon signed-rank test) (Supplementary Fig. 2a). SIFT-seq identified all species detected from conventional urine culture (16/16) while LBBC only detected 10/16 species reported by culture (Supplementary Fig. 2b). The decrease in false positive rate after LBBC filtering occurred at the expense of decreased true positive rate. We also performed SIFT-seq on negative controls included in 32/33 experimental batches (see Methods). We quantified the reads originating from contaminant genera before and after SIFT-seq filtering and found that SIFT-seq removed 95.8% of all contaminant genera detected in the negative controls (506.7 ± 827.53 versus 0.4 ± 0.6 MPM before and after SIFT-seq filtering, respectively (p value < 0.001, two-tailed, Wilcoxon signed-rank test) (Supplementary Fig. 3).

SIFT-seq identifies bacterial and viral co-infection of COVID-19 from blood

The COVID-19 pandemic is an unprecedented human health crisis. Viral or bacterial co-infection occurs in roughly 4% of hospitalized COVID-19 patients but can occur in up to 30% of COVID-19 patients admitted to the intensive care unit22. Co-infection has been associated with longer fever duration, and increased risk of intensive care unit admission and need for mechanical ventilation23. We reasoned that SIFT-seq may offer sensitive detection of bacterial and viral co-infection in COVID-19 patients with improved specificity over conventional metagenomic sequencing assays.

We applied SIFT-seq to 30 plasma samples from 14 patients with COVID-19 collected as part of a clinical study aimed at identifying predictors of disease severity. Respiratory and blood cultures were obtained as part of standard clinical care. Three patients (P16, P24, P39) tested positive for bloodstream infection and respiratory tract infection, while all other patients were not diagnosed with COVID-19 co-infection. SIFT-seq identified the causative pathogen in 3/3 bloodstream infection cases and 8/8 respiratory infection cases (Fig. 4a, b). Conventional metagenomic sequencing (without SIFT-seq filtering) was equally sensitive to these pathogens but was limited by specificity. Of interest, while we did not obtain plasma collected the day of infection for P24, we identified cfDNA originating from K. pneumoniae and Haemophilus influenzae, for which the patient tested positive 4 days later. These results suggest that SIFT-seq may be able to identify cases of infection earlier than traditional culture methods, and with improved specificity compared to conventional metagenomic sequencing techniques.

Fig. 4: Application of SIFT-seq to plasma.
figure 4

Heatmaps of the abundance of species identified in plasma from COVID-19 patients with and without culture confirmed (a) lung and (b) blood infection, before and after application of SIFT-seq filter (black * indicates agreement with culture; HCMV: Human cytomegalovirus, HSV-1: Herpes simplex virus 1). c A heatmap of abundance of species identified in the sepsis cohort before and after SIFT-seq filtering (black * indicates species identified by blood culture). d Barplot of the prevalence of Epstein-Barr Virus (EBV), Torque teno virus (TTV), malaria-causing, or shigellosis-causing microorganisms in different patient cohorts. e Heatmap of the abundance of species identified in matched stool and plasma cfDNA samples in patients diagnosed with Crohn’s disease or ulcerative colitis. f Schematic for matched stool and plasma samples from individuals before and after medical therapy. g Heatmap of the change in abundance of gut-specific bacteria before and after treatment. Source data are provided as a source data file.

SIFT-seq identifies infection-causing pathogens in sepsis patients

Sepsis is a life-threatening organ dysfunction caused by dysregulated host response to a bacterial, viral, fungal or parasitic infection24. According to the World Health Organization, in 2017 there were 48.9 million sepsis cases and 11 million sepsis-related deaths worldwide. When sepsis is suspected, broad-spectrum empiric antibiotics are administered, and tests are performed to identify the infection-causing pathogens. Blood culture is the gold standard method to detect infectious pathogens in the bloodstream, however this method is time consuming and limited to few culturable microbes. Though other molecular tests can shorten time to results when performed directly on blood, the low microbial burden in blood leads to low sensitivity, low negative predictive values, and detection of only a few specific pathogens25. Thus, conventional metagenomic cfDNA sequencing holds promise in identifying sepsis-causing pathogens26,27.

We tested the utility of SIFT-seq to identify sepsis-causing pathogens in patients with sepsis. For this, we performed a blinded analysis on 15 plasma samples (n = 10 from septic patients and n = 5 from non-septic patients. 9/15 patients had a positive blood culture result (9/10 patients with sepsis), 3/15 had negative blood culture and for 3/15 patients blood culture was not performed. A total of 10 pathogens were identified in the 9 blood-culture positive samples. After unblinding we found a strong agreement between pathogens that were identified by blood culture and those that were identified by SIFT-seq: SIFT-seq detected 10 out of 10 of pathogens reported by blood culture. Importantly, for only 2/9 patients with positive blood culture, a plasma sample was collected at the time of the positive blood culture (E. faecalis identified by blood culture and SIFT-seq, Fig. 4c). For 7/9 patients, the plasma sample for SIFT-seq was collected after the initial positive blood culture and after initiation of antibiotic treatment. Blood cultures corresponding to the time of plasma sample collection for those 7/9 samples were all negative, while SIFT-seq correctly identified the pathogen identified by culture in the sample before initiation of antibiotic treatment. This experiment demonstrates the utility of SIFT-seq to identify blood-borne pathogens in the setting of sepsis, even after initiation of antibiotic treatment when blood cultures frequently fail.

SIFT-seq identifies clinically-relevant bacterial and viral microorganisms with low prevalence and low microbial burden

Neglected tropical diseases significantly impact the public health and economies of low-income countries. Treatments exist for many of these diseases, but development and deployment of reliable diagnostic tests has been slow28. We reasoned that SIFT-seq could be used to screen for infections with low prevalence and low microbial burden.

We applied SIFT-seq to 56 plasma samples from 44 individuals who presented with symptoms of respiratory illness at outpatient clinics in Uganda. Nine of these individuals were HIV positive at the time of sample collection. We mined the data to determine the prevalence of clinically-relevant bacterial and viral microorganisms endemic to Uganda and compared with results obtained for plasma samples collected from subjects that live in North America (53 plasma samples from the IBD cohort; 30 plasma samples from the COVID-19 cohort). We screened the samples for Epstein-Barr virus, Torque Teno virus, and pathogens associated with malaria (Plasmodium vivax and P. falciparum), and shigellosis (Shigella sonnei, S. dysenteriae, S. boydii, and S. flexneri) before (Supplementary Fig. 4) and after SIFT-seq filtering (Fig. 4d). After SIFT-seq filtering, these microorganisms were found at varying rates in samples from the Uganda cohort: malaria (3/44), Epstein-Barr virus (1/44), shigellosis (19/44), and torque teno virus (1/44), but not in the IBD cohort. Torque teno virus, which has previously been reported to be elevated in immunocompromised patients8, was identified in 3/30 COVID-19 patient samples, all from patients who had received a bone marrow transplant prior to sample acquisition.

SIFT-seq identifies signatures of bacterial translocation from the gastrointestinal tract

Bacterial translocation of intestinal microbes through mucosal membranes is believed to be a normal phenomenon, but has been found to occur more frequently in patients experiencing gut flora disruption29,30. In patients with IBD, gut vascular barrier disruption has been linked to increased intestinal permeability and subsequent microbial translocation across the mucosal membrane31,32. The translocation of gut bacteria and their products to extraintestinal sites can result in systemic inflammation, resulting in autoimmune or other non-infectious diseases31,33. Detecting signatures of translocation is therefore important but difficult in view of the low abundance of microbial DNA due to translocation in blood.

To identify signatures of bacterial translocation, we compared whole genome shotgun sequencing of fecal samples from 44 patients (non-IBD n = 4, Crohn’s n = 19, ulcerative colitis, n = 21) to matched plasma cfDNA samples assayed using SIFT-seq. We first quantified bacterial species identified in matched fecal and plasma samples (Fig. 4e). We identified cfDNA derived from gut-specific microbes in all patient samples, though to a much greater extent in individuals with ulcerative colitis (0.57 ± 0.65, 1.22 ± 1.38, and 5.55 ± 9.46 MPM of gut-specific bacteria for non-IBD, Crohn’s disease, and ulcerative colitis samples, respectively). To investigate the effects of treatment on bacterial translocation, we collected additional stool and plasma samples from nine patients (Crohn’s n = 3, ulcerative colitis n = 6) after treatment initiation and performed whole genome shotgun sequencing of stool and SIFT-seq on plasma cfDNA (Fig. 4f). We quantified the relative abundance of gut-specific bacterial species before and after treatment and found that the burden of cfDNA decreased for most bacterial species (28/36) following treatment, which may be explained by a reduction in the degree of bacterial translocation with treatment (Fig. 4g). Of interest, out of seven subjects for which we detected Lactobacillus before treatment, five displayed an increase in Lactobacillus species burden in blood after treatment (up to 12.7-fold increase after treatment and an average of 3.36-fold MPM increase after treatment across all samples). Lactobacillus has been shown to promote gastrointestinal barrier function, protecting the gut from pathogenic bacteria and preventing inflammation32. For bacterial species besides Lactobacillus, we find an average of 70% reduction in MPM after treatment. These preliminary results support the use of SIFT-seq to identify subtle signatures of bacterial translocation in the blood.

Registry Cleaners