1
|
Lorv JSH, McConkey BJ. Kastor: a reference-based comparative approach for assessment and correction of gene-fragmenting errors in long-read assemblies of small genomes. BMC Genomics 2025; 26:388. [PMID: 40251490 PMCID: PMC12007338 DOI: 10.1186/s12864-025-11569-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 04/04/2025] [Indexed: 04/20/2025] Open
Abstract
Long read sequencing technologies provide an efficient approach to generating highly contiguous and informative assemblies. However, higher relative error rates can introduce frameshifts and premature stop codons that pseudogenize genes, hindering downstream analyses. We developed a software tool that detects gene-fragmenting errors in draft assemblies of small genomes through comparison with a curated set of reference genome sequences and raw read information. In our presented example, detected errors represent less than 0.05% of the genome, but when corrected reduced the rate of pseudogenes from 23.3 to 5.6% in example long read assemblies, comparable to the rate of pseudogenes in short read assemblies. We demonstrate that this software can detect assembly errors in long read assemblies generated from small genomes and correct them to de-fragment genes.
Collapse
Affiliation(s)
- Janet S H Lorv
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
| | | |
Collapse
|
2
|
Mariene GM, Wasmuth JD. Genome assembly variation and its implications for gene discovery in nematodes. Int J Parasitol 2025; 55:239-252. [PMID: 39832614 DOI: 10.1016/j.ijpara.2025.01.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 10/10/2024] [Accepted: 01/12/2025] [Indexed: 01/22/2025]
Abstract
Genome assemblers are a critical component of genome science, but the choice of assembly software and protocols can be daunting. Here, we investigate genome assembly variation and its implications for gene discovery across three nematode species-Caenorhabditis bovis, Haemonchus contortus, and Heligmosomoides bakeri-highlighting the critical interplay between assembly choice and downstream genomic analysis. Selecting commonly used genome assemblers, we generated multiple assemblies for each species, analyzing their structure, completeness, and effect on gene family analysis. Our findings demonstrate that assembly variations can significantly affect gene family composition, with notable differences in gene families important in anthelmintic discovery and immunomodulation. Despite broadly similar performance using various assembly metrics, comparisons of assemblies with a single species revealed underlying structural rearrangements and inconsistencies in gene content, which would affect downstream analyses. This emphasizes the need for continuous refinement of genome assemblies and their annotations.
Collapse
Affiliation(s)
- Grace M Mariene
- Faculty of Veterinary Medicine, University of Calgary, Calgary, Alberta, Canada; Host-Parasite Interactions Research Training Network, University of Calgary, Calgary, Alberta, Canada
| | - James D Wasmuth
- Faculty of Veterinary Medicine, University of Calgary, Calgary, Alberta, Canada; Host-Parasite Interactions Research Training Network, University of Calgary, Calgary, Alberta, Canada.
| |
Collapse
|
3
|
Zhang E, Coombe L, Wong J, Warren RL, Birol I. GoldPolish-target: targeted long-read genome assembly polishing. BMC Bioinformatics 2025; 26:78. [PMID: 40055584 PMCID: PMC11887200 DOI: 10.1186/s12859-025-06091-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Accepted: 02/19/2025] [Indexed: 03/12/2025] Open
Abstract
BACKGROUND Advanced long-read sequencing technologies, such as those from Oxford Nanopore Technologies and Pacific Biosciences, are finding a wide use in de novo genome sequencing projects. However, long reads typically have higher error rates relative to short reads. If left unaddressed, subsequent genome assemblies may exhibit high base error rates that compromise the reliability of downstream analysis. Several specialized error correction tools for genome assemblies have since emerged, employing a range of algorithms and strategies to improve base quality. However, despite these efforts, many genome assembly workflows still produce regions with elevated error rates, such as gaps filled with unpolished or ambiguous bases. To address this, we introduce GoldPolish-Target, a modular targeted sequence polishing pipeline. Coupled with GoldPolish, a linear-time genome assembly algorithm, GoldPolish-Target isolates and polishes user-specified assembly loci, offering a resource-efficient means for polishing targeted regions of draft genomes. RESULTS Experiments using Drosophila melanogaster and Homo sapiens datasets demonstrate that GoldPolish-Target can reduce insertion/deletion (indel) and mismatch errors by up to 49.2% and 55.4% respectively, achieving base accuracy values upwards of 99.9% (Phred score Q > 30). This polishing accuracy is comparable to the current state-of-the-art, Medaka, while exhibiting up to 27-fold shorter run times and consuming 95% less memory, on average. CONCLUSION GoldPolish-Target, in contrast to most other polishing tools, offers the ability to target specific regions of a genome assembly for polishing, providing a computationally light-weight and highly scalable solution for base error correction.
Collapse
Affiliation(s)
- Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
4
|
Park G, An H, Luo H, Park J. NanoMnT: an STR analysis tool for Oxford Nanopore sequencing data driven by a comprehensive analysis of error profile in STR regions. Gigascience 2025; 14:giaf013. [PMID: 40094553 PMCID: PMC11912559 DOI: 10.1093/gigascience/giaf013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 12/29/2024] [Accepted: 02/02/2025] [Indexed: 03/19/2025] Open
Abstract
Oxford Nanopore Technology (ONT) sequencing is a third-generation sequencing technology that enables cost-effective long-read sequencing, with broad applications in biological research. However, its high sequencing error rate in low-complexity regions hampers its applications in short tandem repeat (STR)-related research. To address this, we generated a comprehensive STR error profile of ONT by analyzing publicly available Nanopore sequencing datasets. We show that the sequencing error rate is influenced not only by STR length but also by the repeat unit and the flanking sequences of STR regions. Interestingly, certain flanking sequences were associated with higher sequencing accuracy, suggesting that certain STR loci are more suitable for Nanopore sequencing compared to other loci. While base quality scores of substitution errors within the STR regions were lower than those of correctly sequenced bases, such patterns were not observed for indel errors. Furthermore, choosing the most recent basecaller version and using the super accuracy model significantly improved STR sequencing accuracy. Finally, we present NanoMnT, a lightweight Python tool that corrects STR sequencing errors in sequencing data and estimates STR allele sizes. NanoMnT leverages the characteristics of ONT when estimating STR allele size and exhibits superior results for 1-bp- and 2-bp repeat STR compared to existing tools. By integrating our findings, we improved STR allele estimation accuracy for Ax10 repeats from 55% to 78% and up to 85% when excluding loci with unfavorable flanking sequences. Using NanoMnT, we present the utility of our findings by identifying microsatellite instability status in cancer sequencing data. NanoMnT is publicly available at https://github.com/18parkky/NanoMnT.
Collapse
Affiliation(s)
- Gyumin Park
- School of Life Sciences, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Republic of Korea
| | - Hyunsu An
- School of Life Sciences, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Republic of Korea
| | - Han Luo
- Department of Thyroid and Parathyroid Surgery, Laboratory of thyroid and parathyroid disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, Sichuan 61005, China
| | - Jihwan Park
- School of Life Sciences, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Republic of Korea
| |
Collapse
|
5
|
Sun B, Guo J, Jin H, Jin Z, Sun Y, Mao Y, Xie F, He Y, Sun Z, Li W, Ivanov I, Tian H. MetaCONNET: A metagenomic polishing tool for long-read assemblies. PLoS One 2024; 19:e0313515. [PMID: 39625881 PMCID: PMC11614293 DOI: 10.1371/journal.pone.0313515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 10/25/2024] [Indexed: 12/06/2024] Open
Abstract
Accurate and high coverage genome assemblies are the basis for downstream analysis of metagenomic studies. Long-read sequencing technology is an ideal tool to facilitate the assemblies of metagenome, except for the drawback of usually producing reads with high sequencing error rate. Many polishing tools were developed to correct the sequencing error, but most are designed on the ground of one or two species. Considering the complexity and uneven depth of metagenomic study, we present a novel deep-learning polishing tool named MetaCONNET for polishing metagenomic assemblies. We evaluate MetaCONNET against Medaka, CONNET and NextPolish in accuracy, coverage, contiguity and resource consumption. Our results demonstrate that MetaCONNET provides a valuable polishing tool and can be applied to many metagenomic studies.
Collapse
Affiliation(s)
- Bingru Sun
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Jian Guo
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Hao Jin
- Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, China
| | - Zijie Jin
- Peking University International Cancer Institute, Health Science Center, Peking University, Beijing, China
| | - Yaping Sun
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Yuanchen Mao
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Fuli Xie
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Yun He
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Zhihong Sun
- Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, China
| | - Wei Li
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Igor Ivanov
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| | - Hui Tian
- Axbio Biotechnology (Shenzhen) Co., Ltd., Shenzhen, China
| |
Collapse
|
6
|
Smith GJ, van Alen TA, van Kessel MA, Lücker S. Simple, reference-independent assessment to empirically guide correction and polishing of hybrid microbial community metagenomic assembly. PeerJ 2024; 12:e18132. [PMID: 39529629 PMCID: PMC11552494 DOI: 10.7717/peerj.18132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 08/29/2024] [Indexed: 11/16/2024] Open
Abstract
Hybrid metagenomic assembly of microbial communities, leveraging both long- and short-read sequencing technologies, is becoming an increasingly accessible approach, yet its widespread application faces several challenges. High-quality references may not be available for assembly accuracy comparisons common for benchmarking, and certain aspects of hybrid assembly may benefit from dataset-dependent, empiric guidance rather than the application of a uniform approach. In this study, several simple, reference-free characteristics-particularly coding gene content and read recruitment profiles-were hypothesized to be reliable indicators of assembly quality improvement during iterative error-fixing processes. These characteristics were compared to reference-dependent genome- and gene-centric analyses common for microbial community metagenomic studies. Two laboratory-scale bioreactors were sequenced with short- and long-read platforms, and assembled with commonly used software packages. Following long read assembly, long read correction and short read polishing were iterated up to ten times to resolve errors. These iterative processes were shown to have a substantial effect on gene- and genome-centric community compositions. Simple, reference-free assembly characteristics, specifically changes in gene fragmentation and short read recruitment, were robustly correlated with advanced analyses common in published comparative studies, and therefore are suitable proxies for hybrid metagenome assembly quality to simplify the identification of the optimal number of correction and polishing iterations. As hybrid metagenomic sequencing approaches will likely remain relevant due to the low added cost of short-read sequencing for differential coverage binning or the ability to access lower abundance community members, it is imperative that users are equipped to estimate assembly quality prior to downstream analyses.
Collapse
Affiliation(s)
- Garrett J. Smith
- Department of Microbiology, The Ohio State University, Columbus, OH, United States of America
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| | - Theo A. van Alen
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| | - Maartje A.H.J. van Kessel
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| | - Sebastian Lücker
- Department of Microbiology, Radboud Institute for Biological and Environmental Sciences, Radboud University, Nijmegen, Netherlands
| |
Collapse
|
7
|
Cui FJ, Fu X, Sun L, Zan XY, Meng LJ, Sun WJ. Recent insights into glucans biosynthesis and engineering strategies in edible fungi. Crit Rev Biotechnol 2024; 44:1262-1279. [PMID: 38105513 DOI: 10.1080/07388551.2023.2289341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 02/28/2023] [Accepted: 04/21/2023] [Indexed: 12/19/2023]
Abstract
Fungal α/β-glucans have significant importance in cellular functions including cell wall structure, host-pathogen interactions and energy storage, and wide application in high-profile fields, including food, nutrition, and pharmaceuticals. Fungal species and their growth/developmental stages result in a diversity of glucan contents, structures and bioactivities. Substantial progresses have been made to elucidate the fine structures and functions, and reveal the potential molecular synthesis pathway of fungal α/β-glucans. Herein, we review the current knowledge about the biosynthetic machineries, including: precursor UDP-glucose synthesis, initiation, elongation/termination and remodeling of α/β-glucan chains, and molecular regulation to maximally produce glucans in edible fungi. This review would provide future perspectives to biosynthesize the targeted glucans and reveal the catalytic mechanism of enzymes associated with glucan synthesis, including: UDP-glucose pyrophosphate phosphorylases (UGP), glucan synthases, and glucanosyltransferases in edible fungi.
Collapse
Affiliation(s)
- Feng-Jie Cui
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang, P. R. China
- Jiangxi Provincial Engineering and Technology Center for Food Additives Bio-production, Dexing, P. R. China
| | - Xin Fu
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang, P. R. China
| | - Lei Sun
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang, P. R. China
| | - Xin-Yi Zan
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang, P. R. China
| | - Li-Juan Meng
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang, P. R. China
| | - Wen-Jing Sun
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang, P. R. China
- Jiangxi Provincial Engineering and Technology Center for Food Additives Bio-production, Dexing, P. R. China
| |
Collapse
|
8
|
Sierra R, Roch M, Moraz M, Prados J, Vuilleumier N, Emonet S, Andrey DO. Contributions of Long-Read Sequencing for the Detection of Antimicrobial Resistance. Pathogens 2024; 13:730. [PMID: 39338921 PMCID: PMC11434816 DOI: 10.3390/pathogens13090730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 08/20/2024] [Accepted: 08/22/2024] [Indexed: 09/30/2024] Open
Abstract
BACKGROUND In the context of increasing antimicrobial resistance (AMR), whole-genome sequencing (WGS) of bacteria is considered a highly accurate and comprehensive surveillance method for detecting and tracking the spread of resistant pathogens. Two primary sequencing technologies exist: short-read sequencing (50-300 base pairs) and long-read sequencing (thousands of base pairs). The former, based on Illumina sequencing platforms (ISPs), provides extensive coverage and high accuracy for detecting single nucleotide polymorphisms (SNPs) and small insertions/deletions, but is limited by its read length. The latter, based on platforms such as Oxford Nanopore Technologies (ONT), enables the assembly of genomes, particularly those with repetitive regions and structural variants, although its accuracy has historically been lower. RESULTS We performed a head-to-head comparison of these techniques to sequence the K. pneumoniae VS17 isolate, focusing on blaNDM resistance gene alleles in the context of a surveillance program. Discrepancies between the ISP (blaNDM-4 allele identified) and ONT (blaNDM-1 and blaNDM-5 alleles identified) were observed. Conjugation assays and Sanger sequencing, used as the gold standard, confirmed the validity of ONT results. This study demonstrates the importance of long-read or hybrid assemblies for accurate carbapenemase resistance gene identification and highlights the limitations of short reads in the context of gene duplications or multiple alleles. CONCLUSIONS In this proof-of-concept study, we conclude that recent long-read sequencing technology may outperform standard short-read sequencing for the accurate identification of carbapenemase alleles. Such information is crucial given the rising prevalence of strains producing multiple carbapenemases, especially as WGS is increasingly used for epidemiological surveillance and infection control.
Collapse
Affiliation(s)
- Roberto Sierra
- Infectious Diseases Division, Department of Medicine, Geneva University Hospitals and Faculty of Medicine, 1205 Geneva, Switzerland; (R.S.)
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, 1206 Geneva, Switzerland
- Division of Laboratory Medicine, Diagnostics Department, Geneva University Hospitals and Faculty of Medicine, 1205 Geneva, Switzerland
| | - Mélanie Roch
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, 1206 Geneva, Switzerland
| | - Milo Moraz
- Infectious Diseases Division, Institut Central des Hôpitaux (ICH), Valais Hospital, 1951 Sion, Switzerland
| | - Julien Prados
- Bioinformatics Support Platform, Faculty of Medicine, University of Geneva, 1206 Geneva, Switzerland
| | - Nicolas Vuilleumier
- Division of Laboratory Medicine, Diagnostics Department, Geneva University Hospitals and Faculty of Medicine, 1205 Geneva, Switzerland
| | - Stéphane Emonet
- Infectious Diseases Division, Department of Medicine, Geneva University Hospitals and Faculty of Medicine, 1205 Geneva, Switzerland; (R.S.)
- Infectious Diseases Division, Institut Central des Hôpitaux (ICH), Valais Hospital, 1951 Sion, Switzerland
| | - Diego O. Andrey
- Infectious Diseases Division, Department of Medicine, Geneva University Hospitals and Faculty of Medicine, 1205 Geneva, Switzerland; (R.S.)
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, 1206 Geneva, Switzerland
- Division of Laboratory Medicine, Diagnostics Department, Geneva University Hospitals and Faculty of Medicine, 1205 Geneva, Switzerland
| |
Collapse
|
9
|
Kasianova AM, Penin AA, Schelkunov MI, Kasianov AS, Logacheva MD, Klepikova AV. Trans2express - de novo transcriptome assembly pipeline optimized for gene expression analysis. PLANT METHODS 2024; 20:128. [PMID: 39152473 PMCID: PMC11330051 DOI: 10.1186/s13007-024-01255-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 08/01/2024] [Indexed: 08/19/2024]
Abstract
BACKGROUND As genomes of many eukaryotic species, especially plants, are large and complex, their de novo sequencing and assembly is still a difficult task despite progress in sequencing technologies. An alternative to genome assembly is the assembly of transcriptome, the set of RNA products of the expressed genes. While a bunch of de novo transcriptome assemblers exists, the challenges of transcriptomes (the existence of isoforms, the uneven expression levels across genes) complicates the generation of high-quality assemblies suitable for downstream analyses. RESULTS We developed Trans2express - a web-based tool and a pipeline of de novo hybrid transcriptome assembly and postprocessing based on rnaSPAdes with a set of subsequent filtrations. The pipeline was tested on Arabidopsis thaliana cDNA sequencing data obtained using Illumina and Oxford Nanopore Technologies platforms and three non-model plant species. The comparison of structural characteristics of the transcriptome assembly with reference Arabidopsis genome revealed the high quality of assembled transcriptome with 86.1% of Arabidopsis expressed genes assembled as a single contig. We tested the applicability of the transcriptome assembly for gene expression analysis. For both Arabidopsis and non-model species the results showed high congruence of gene expression levels and sets of differentially expressed genes between analyses based on genome and based on the transcriptome assembly. CONCLUSIONS We present Trans2express - a protocol for de novo hybrid transcriptome assembly aimed at recovering of a single transcript per gene. We expect this protocol to promote the characterization of transcriptomes and gene expression analysis in non-model plants and web-based tool to be of use to a wide range of plant biologists.
Collapse
Affiliation(s)
- Aleksandra M Kasianova
- Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia
- Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Aleksey A Penin
- Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia
| | - Mikhail I Schelkunov
- Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia
- Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Artem S Kasianov
- Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia
| | - Maria D Logacheva
- Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia
- Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Anna V Klepikova
- Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia.
| |
Collapse
|
10
|
Valentin-Alvarado LE, Appler KE, De Anda V, Schoelmerich MC, West-Roberts J, Kivenson V, Crits-Christoph A, Ly L, Sachdeva R, Greening C, Savage DF, Baker BJ, Banfield JF. Asgard archaea modulate potential methanogenesis substrates in wetland soil. Nat Commun 2024; 15:6384. [PMID: 39085194 PMCID: PMC11291895 DOI: 10.1038/s41467-024-49872-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Accepted: 06/20/2024] [Indexed: 08/02/2024] Open
Abstract
The roles of Asgard archaea in eukaryogenesis and marine biogeochemical cycles are well studied, yet their contributions in soil ecosystems remain unknown. Of particular interest are Asgard archaeal contributions to methane cycling in wetland soils. To investigate this, we reconstructed two complete genomes for soil-associated Atabeyarchaeia, a new Asgard lineage, and a complete genome of Freyarchaeia, and predicted their metabolism in situ. Metatranscriptomics reveals expression of genes for [NiFe]-hydrogenases, pyruvate oxidation and carbon fixation via the Wood-Ljungdahl pathway. Also expressed are genes encoding enzymes for amino acid metabolism, anaerobic aldehyde oxidation, hydrogen peroxide detoxification and carbohydrate breakdown to acetate and formate. Overall, soil-associated Asgard archaea are predicted to include non-methanogenic acetogens, highlighting their potential role in carbon cycling in terrestrial environments.
Collapse
Affiliation(s)
- Luis E Valentin-Alvarado
- Innovative Genomics Institute, University of California, Berkeley, California, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
| | - Kathryn E Appler
- Department of Marine Science, University of Texas at Austin; Marine Science Institute, Port Aransas, TX, USA
| | - Valerie De Anda
- Department of Marine Science, University of Texas at Austin; Marine Science Institute, Port Aransas, TX, USA
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
| | - Marie C Schoelmerich
- Innovative Genomics Institute, University of California, Berkeley, California, USA
- Department of Environmental Systems Sciences; ETH Zürich, Zürich, Switzerland
| | - Jacob West-Roberts
- Environmental Science, Policy and Management, University of California, Berkeley, CA, USA
| | - Veronika Kivenson
- Innovative Genomics Institute, University of California, Berkeley, California, USA
| | - Alexander Crits-Christoph
- Innovative Genomics Institute, University of California, Berkeley, California, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
- Cultivarium, Watertown, MA, USA
| | - Lynn Ly
- Oxford Nanopore Technologies Inc, New York, NY, USA
| | - Rohan Sachdeva
- Innovative Genomics Institute, University of California, Berkeley, California, USA
| | - Chris Greening
- Department of Microbiology, Biomedicine Discovery Institute; Monash University, Clayton, VIC, Australia
- Securing Antarctica's Environmental Future, Monash University, Clayton, VIC, Australia
| | - David F Savage
- Innovative Genomics Institute, University of California, Berkeley, California, USA
- Howard Hughes Medical Institute, University of California, Berkeley, California, USA
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, USA
| | - Brett J Baker
- Department of Marine Science, University of Texas at Austin; Marine Science Institute, Port Aransas, TX, USA.
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.
| | - Jillian F Banfield
- Innovative Genomics Institute, University of California, Berkeley, California, USA.
- Environmental Science, Policy and Management, University of California, Berkeley, CA, USA.
- Department of Microbiology, Biomedicine Discovery Institute; Monash University, Clayton, VIC, Australia.
- Earth and Planetary Science, University of California, Berkeley, CA, USA.
| |
Collapse
|
11
|
Liu C, Wu P, Wu X, Zhao X, Chen F, Cheng X, Zhu H, Wang O, Xu M. AsmMix: an efficient haplotype-resolved hybrid de novo genome assembling pipeline. Front Genet 2024; 15:1421565. [PMID: 39130747 PMCID: PMC11310137 DOI: 10.3389/fgene.2024.1421565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 07/05/2024] [Indexed: 08/13/2024] Open
Abstract
Accurate haplotyping facilitates distinguishing allele-specific expression, identifying cis-regulatory elements, and characterizing genomic variations, which enables more precise investigations into the relationship between genotype and phenotype. Recent advances in third-generation single-molecule long read and synthetic co-barcoded read sequencing techniques have harnessed long-range information to simplify the assembly graph and improve assembly genomic sequence. However, it remains methodologically challenging to reconstruct the complete haplotypes due to high sequencing error rates of long reads and limited capturing efficiency of co-barcoded reads. We here present a pipeline, AsmMix, for generating both contiguous and accurate diploid genomes. It first assembles co-barcoded reads to generate accurate haplotype-resolved assemblies that may contain many gaps, while the long-read assembly is contiguous but susceptible to errors. Then two assembly sets are integrated into haplotype-resolved assemblies with reduced misassembles. Through extensive evaluation on multiple synthetic datasets, AsmMix consistently demonstrates high precision and recall rates for haplotyping across diverse sequencing platforms, coverage depths, read lengths, and read accuracies, significantly outperforming other existing tools in the field. Furthermore, we validate the effectiveness of our pipeline using a human whole genome dataset (HG002), and produce highly contiguous, accurate, and haplotype-resolved assemblies. These assemblies are evaluated using the GIAB benchmarks, confirming the accuracy of variant calling. Our results demonstrate that AsmMix offers a straightforward yet highly efficient approach that effectively leverages both long reads and co-barcoded reads for haplotype-resolved assembly.
Collapse
Affiliation(s)
- Chao Liu
- BGI, Tianjin, China
- BGI Research, Shenzhen, China
| | - Pei Wu
- BGI, Tianjin, China
- BGI Research, Shenzhen, China
| | - Xue Wu
- BGI Research, Shenzhen, China
| | | | | | | | - Hongmei Zhu
- BGI, Tianjin, China
- BGI Research, Shenzhen, China
| | - Ou Wang
- BGI Research, Shenzhen, China
| | - Mengyang Xu
- BGI Research, Shenzhen, China
- BGI Research, Qingdao, China
| |
Collapse
|
12
|
Luan T, Commichaux S, Hoffmann M, Jayeola V, Jang JH, Pop M, Rand H, Luo Y. Benchmarking short and long read polishing tools for nanopore assemblies: achieving near-perfect genomes for outbreak isolates. BMC Genomics 2024; 25:679. [PMID: 38978005 PMCID: PMC11232133 DOI: 10.1186/s12864-024-10582-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 07/01/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND Oxford Nanopore provides high throughput sequencing platforms able to reconstruct complete bacterial genomes with 99.95% accuracy. However, even small levels of error can obscure the phylogenetic relationships between closely related isolates. Polishing tools have been developed to correct these errors, but it is uncertain if they obtain the accuracy needed for the high-resolution source tracking of foodborne illness outbreaks. RESULTS We tested 132 combinations of assembly and short- and long-read polishing tools to assess their accuracy for reconstructing the genome sequences of 15 highly similar Salmonella enterica serovar Newport isolates from a 2020 onion outbreak. While long-read polishing alone improved accuracy, near perfect accuracy (99.9999% accuracy or ~ 5 nucleotide errors across the 4.8 Mbp genome, excluding low confidence regions) was only obtained by pipelines that combined both long- and short-read polishing tools. Notably, medaka was a more accurate and efficient long-read polisher than Racon. Among short-read polishers, NextPolish showed the highest accuracy, but Pilon, Polypolish, and POLCA performed similarly. Among the 5 best performing pipelines, polishing with medaka followed by NextPolish was the most common combination. Importantly, the order of polishing tools mattered i.e., using less accurate tools after more accurate ones introduced errors. Indels in homopolymers and repetitive regions, where the short reads could not be uniquely mapped, remained the most challenging errors to correct. CONCLUSIONS Short reads are still needed to correct errors in nanopore sequenced assemblies to obtain the accuracy required for source tracking investigations. Our granular assessment of the performance of the polishing pipelines allowed us to suggest best practices for tool users and areas for improvement for tool developers.
Collapse
Affiliation(s)
- Tu Luan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Seth Commichaux
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, Laurel, MD, 20708, USA.
| | - Maria Hoffmann
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, 20740, USA
| | - Victor Jayeola
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, 20740, USA
| | - Jae Hee Jang
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, 20740, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Hugh Rand
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, 20740, USA
| | - Yan Luo
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, 20740, USA
| |
Collapse
|
13
|
Agustinho DP, Fu Y, Menon VK, Metcalf GA, Treangen TJ, Sedlazeck FJ. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods 2024; 21:954-966. [PMID: 38689099 PMCID: PMC11955098 DOI: 10.1038/s41592-024-02262-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 03/29/2024] [Indexed: 05/02/2024]
Abstract
Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.
Collapse
Affiliation(s)
- Daniel P Agustinho
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vipin K Menon
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
- Senior research project manager, Human Genetics, Genentech, South San Francisco, CA, USA
| | - Ginger A Metcalf
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Department of Bioengineering, Rice University, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
14
|
Kim J, Steinegger M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA. Nat Methods 2024; 21:971-973. [PMID: 38769467 DOI: 10.1038/s41592-024-02273-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 04/11/2024] [Indexed: 05/22/2024]
Abstract
Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli ( https://metabuli.steineggerlab.com ), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.
Collapse
Affiliation(s)
- Jaebeom Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Martin Steinegger
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
15
|
Liu-Wei W, van der Toorn W, Bohn P, Hölzer M, Smyth RP, von Kleist M. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. BMC Genomics 2024; 25:528. [PMID: 38807060 PMCID: PMC11134706 DOI: 10.1186/s12864-024-10440-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 05/21/2024] [Indexed: 05/30/2024] Open
Abstract
BACKGROUND Direct RNA sequencing (dRNA-seq) on the Oxford Nanopore Technologies (ONT) platforms can produce reads covering up to full-length gene transcripts, while containing decipherable information about RNA base modifications and poly-A tail lengths. Although many published studies have been expanding the potential of dRNA-seq, its sequencing accuracy and error patterns remain understudied. RESULTS We present the first comprehensive evaluation of sequencing accuracy and characterisation of systematic errors in dRNA-seq data from diverse organisms and synthetic in vitro transcribed RNAs. We found that for sequencing kits SQK-RNA001 and SQK-RNA002, the median read accuracy ranged from 87% to 92% across species, and deletions significantly outnumbered mismatches and insertions. Due to their high abundance in the transcriptome, heteropolymers and short homopolymers were the major contributors to the overall sequencing errors. We also observed systematic biases across all species at the levels of single nucleotides and motifs. In general, cytosine/uracil-rich regions were more likely to be erroneous than guanines and adenines. By examining raw signal data, we identified the underlying signal-level features potentially associated with the error patterns and their dependency on sequence contexts. While read quality scores can be used to approximate error rates at base and read levels, failure to detect DNA adapters may be a source of errors and data loss. By comparing distinct basecallers, we reason that some sequencing errors are attributable to signal insufficiency rather than algorithmic (basecalling) artefacts. Lastly, we generated dRNA-seq data using the latest SQK-RNA004 sequencing kit released at the end of 2023 and found that although the overall read accuracy increased, the systematic errors remain largely identical compared to the previous kits. CONCLUSIONS As the first systematic investigation of dRNA-seq errors, this study offers a comprehensive overview of reproducible error patterns across diverse datasets, identifies potential signal-level insufficiency, and lays the foundation for error correction methods.
Collapse
Affiliation(s)
- Wang Liu-Wei
- Systems Medicine of Infectious Disease (P5), Robert Koch Institute, Berlin, Germany.
- International Max-Planck Research School 'Biology and Computation', Max-Planck Institute for Molecular Genetics, Berlin, Germany.
- Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany.
| | - Wiep van der Toorn
- Systems Medicine of Infectious Disease (P5), Robert Koch Institute, Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany
| | - Patrick Bohn
- Helmholtz Institute for RNA-based Infection Research, Helmholtz Centre for Infection Research, Würzburg, Germany
| | - Martin Hölzer
- Genome Competence Center (MF1), Robert Koch Institute, Berlin, Germany
| | - Redmond P Smyth
- Helmholtz Institute for RNA-based Infection Research, Helmholtz Centre for Infection Research, Würzburg, Germany
- Faculty of Medicine, University of Würzburg, Würzburg, Germany
| | - Max von Kleist
- Systems Medicine of Infectious Disease (P5), Robert Koch Institute, Berlin, Germany.
- Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany.
| |
Collapse
|
16
|
de Jong TV, Pan Y, Rastas P, Munro D, Tutaj M, Akil H, Benner C, Chen D, Chitre AS, Chow W, Colonna V, Dalgard CL, Demos WM, Doris PA, Garrison E, Geurts AM, Gunturkun HM, Guryev V, Hourlier T, Howe K, Huang J, Kalbfleisch T, Kim P, Li L, Mahaffey S, Martin FJ, Mohammadi P, Ozel AB, Polesskaya O, Pravenec M, Prins P, Sebat J, Smith JR, Solberg Woods LC, Tabakoff B, Tracey A, Uliano-Silva M, Villani F, Wang H, Sharp BM, Telese F, Jiang Z, Saba L, Wang X, Murphy TD, Palmer AA, Kwitek AE, Dwinell MR, Williams RW, Li JZ, Chen H. A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats. CELL GENOMICS 2024; 4:100527. [PMID: 38537634 PMCID: PMC11019364 DOI: 10.1016/j.xgen.2024.100527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/26/2023] [Accepted: 02/29/2024] [Indexed: 04/09/2024]
Abstract
The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.
Collapse
Affiliation(s)
- Tristan V de Jong
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Yanchao Pan
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Pasi Rastas
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Daniel Munro
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA; Department of Integrative Structural and Computational Biology, Scripps Research, San Diego, CA, USA
| | - Monika Tutaj
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Huda Akil
- Michigan Neuroscience Institute, University of Michigan, Ann Arbor, MI, USA
| | - Chris Benner
- Department of Medicine, University of California San Diego, San Diego, CA, USA
| | - Denghui Chen
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Apurva S Chitre
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - William Chow
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy; Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Clifton L Dalgard
- Department of Anatomy, Physiology & Genetics, The American Genome Center, Uniformed Services University of the Health Sciences, Bethesda, MD, USA
| | - Wendy M Demos
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Peter A Doris
- The Brown Foundation Institute of Molecular Medicine, Center for Human Genetics, University of Texas Health Science Center, Houston, TX, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Aron M Geurts
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Hakan M Gunturkun
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Victor Guryev
- Genome Structure and Ageing, University of Groningen, UMC, Groningen, the Netherlands
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Jun Huang
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ted Kalbfleisch
- Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Louisville, KY, USA
| | - Panjun Kim
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ling Li
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA; Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Spencer Mahaffey
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Pejman Mohammadi
- Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA; Department of Pediatrics, University of Washington School of Medicine, Seattle, WA, USA
| | - Ayse Bilge Ozel
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Oksana Polesskaya
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Michal Pravenec
- Institute of Physiology, Czech Academy of Sciences, Prague, Czechia
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonathan Sebat
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Jennifer R Smith
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Leah C Solberg Woods
- Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Boris Tabakoff
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | | | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hongyang Wang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Burt M Sharp
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Francesca Telese
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Zhihua Jiang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Laura Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Xusheng Wang
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA; Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA; Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Anne E Kwitek
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Melinda R Dwinell
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jun Z Li
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA.
| |
Collapse
|
17
|
Cook R, Telatin A, Hsieh SY, Newberry F, Tariq MA, Baker DJ, Carding SR, Adriaenssens EM. Nanopore and Illumina sequencing reveal different viral populations from human gut samples. Microb Genom 2024; 10:001236. [PMID: 38683195 PMCID: PMC11092197 DOI: 10.1099/mgen.0.001236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 03/18/2024] [Indexed: 05/01/2024] Open
Abstract
The advent of viral metagenomics, or viromics, has improved our knowledge and understanding of global viral diversity. High-throughput sequencing technologies enable explorations of the ecological roles, contributions to host metabolism, and the influence of viruses in various environments, including the human intestinal microbiome. However, bacterial metagenomic studies frequently have the advantage. The adoption of advanced technologies like long-read sequencing has the potential to be transformative in refining viromics and metagenomics. Here, we examined the effectiveness of long-read and hybrid sequencing by comparing Illumina short-read and Oxford Nanopore Technology (ONT) long-read sequencing technologies and different assembly strategies on recovering viral genomes from human faecal samples. Our findings showed that if a single sequencing technology is to be chosen for virome analysis, Illumina is preferable due to its superior ability to recover fully resolved viral genomes and minimise erroneous genomes. While ONT assemblies were effective in recovering viral diversity, the challenges related to input requirements and the necessity for amplification made it less ideal as a standalone solution. However, using a combined, hybrid approach enabled a more authentic representation of viral diversity to be obtained within samples.
Collapse
Affiliation(s)
- Ryan Cook
- Quadram Institute Bioscience, Norwich, NR4 7UQ, UK
| | | | | | - Fiona Newberry
- Department of Biosciences, Nottingham Trent University, Nottingham, NG11 8NS, UK
| | - Mohammad A. Tariq
- Faculty of Health and Life Sciences, University of Northumbria, Newcastle upon Tyne, NE1 8ST, UK
| | | | - Simon R. Carding
- Quadram Institute Bioscience, Norwich, NR4 7UQ, UK
- Norwich Medical School, University of East Anglia, Norwich, NR4 7TJ, UK
| | | |
Collapse
|
18
|
Menzel P. Snakemake workflows for long-read bacterial genome assembly and evaluation. GIGABYTE 2024; 2024:gigabyte116. [PMID: 38591001 PMCID: PMC11000499 DOI: 10.46471/gigabyte.116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 03/22/2024] [Indexed: 04/10/2024] Open
Abstract
With the advancement of long-read sequencing technologies and their increasing use for bacterial genomics, several methods for generating genome assemblies from error-prone long reads have been developed. These are complemented by various tools for assembly polishing using either long reads, short reads, or reference genomes. End users are therefore left with a plethora of possible combinations of programs for obtaining a final trusted assembly. Hence, there is also a need to measure the completeness and accuracy of such assemblies, for which, again, several evaluation methods implemented in various programs are available. In order to automatically run multiple genome assembly and evaluation programs at once, I developed two workflows for the workflow management system Snakemake, which provide end users with an easy-to-run solution for testing various genome assemblies from their sequencing data. Both workflows use the conda packaging system, so there is no need for manual installation of each program. Availability & Implementation The workflows are available as open source software under the MIT license at github.com/pmenzel/ont-assembly-snake and github.com/pmenzel/score-assemblies.
Collapse
Affiliation(s)
- Peter Menzel
- Labor Berlin - Charité Vivantes GmbH, Sylter Str. 2, 13353, Berlin, Germany
| |
Collapse
|
19
|
Ángeles-Argáiz RE, Aguirre-Beltrán LFL, Hernández-Oaxaca D, Quintero-Corrales C, Trujillo-Roldán MA, Castillo-Ramírez S, Garibay-Orijel R. Assembly collapsing versus heterozygosity oversizing: detection of homokaryotic and heterokaryotic Laccaria trichodermophora strains by hybrid genome assembly. Microb Genom 2024; 10:001218. [PMID: 38529901 PMCID: PMC10995626 DOI: 10.1099/mgen.0.001218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 03/01/2024] [Indexed: 03/27/2024] Open
Abstract
Genome assembly and annotation using short-paired reads is challenging for eukaryotic organisms due to their large size, variable ploidy and large number of repetitive elements. However, the use of single-molecule long reads improves assembly quality (completeness and contiguity), but haplotype duplications still pose assembly challenges. To address the effect of read length on genome assembly quality, gene prediction and annotation, we compared genome assemblers and sequencing technologies with four strains of the ectomycorrhizal fungus Laccaria trichodermophora. By analysing the predicted repertoire of carbohydrate enzymes, we investigated the effects of assembly quality on functional inferences. Libraries were generated using three different sequencing platforms (Illumina Next-Seq, Mi-Seq and PacBio Sequel), and genomes were assembled using single and hybrid assemblies/libraries. Long reads or hybrid assemby resolved the collapsing of repeated regions, but the nuclear heterozygous versions remained unresolved. In dikaryotic fungi, each cell includes two nuclei and each nucleus has differences not only in allelic gene version but also in gene composition and synteny. These heterokaryotic cells produce fragmentation and size overestimation of the genome assembly of each nucleus. Hybrid assembly revealed a wider functional diversity of genomes. Here, several predicted oxidizing activities on glycosyl residues of oligosaccharides and several chitooligosaccharide acetylase activities would have passed unnoticed in short-read assemblies. Also, the size and fragmentation of the genome assembly, in combination with heterozygosity analysis, allowed us to distinguish homokaryotic and heterokaryotic strains isolated from L. trichodermophora fruit bodies.
Collapse
Affiliation(s)
- Rodolfo Enrique Ángeles-Argáiz
- Posgrado en Ciencias Biológicas, Universidad Nacional Autónoma de México, Circuito de los Posgrados s/n, Ciudad Universitaria, Delegación Coyoacán, Ciudad de México, México, C.P. 04510, Mexico
- Instituto de Biología, Universidad Nacional Autónoma de México, Tercer Circuito s/n, Ciudad Universitaria, Delegación Coyoacán, Ciudad de México, México, C.P. 04510, Mexico
- Red de Manejo Biotecnológico de Recursos, Instituto de Ecología A. C. Carretera antigua a Coatepec 351, Col. El Haya, Xalapa, Veracruz, México, C.P. 91612, Mexico
| | - Luis Fernando Lozano Aguirre-Beltrán
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos, México, C.P. 62210, Mexico
| | - Diana Hernández-Oaxaca
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos, México, C.P. 62210, Mexico
- Red de Biodiversidad y Sistemática, Instituto de Ecología A. C. Carretera antigua a Coatepec 351, Col. El Haya, Xalapa, Veracruz, México, C.P. 91073, Mexico
| | - Christian Quintero-Corrales
- Posgrado en Ciencias Biológicas, Universidad Nacional Autónoma de México, Circuito de los Posgrados s/n, Ciudad Universitaria, Delegación Coyoacán, Ciudad de México, México, C.P. 04510, Mexico
- Instituto de Biología, Universidad Nacional Autónoma de México, Tercer Circuito s/n, Ciudad Universitaria, Delegación Coyoacán, Ciudad de México, México, C.P. 04510, Mexico
| | - Mauricio A. Trujillo-Roldán
- Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Tercer Circuito s/n, Ciudad Universitaria, Delegación Coyoacán, Ciudad de México, México, C.P. 04510, Mexico
- Centro de Nanociencias y Nanotecnología, Universidad Nacional Autónoma de México, Km 107 carretera Tijuana-Ensenada, Ensenada, Baja California, Mexico, C.P. 22860, Mexico
| | - Santiago Castillo-Ramírez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos, México, C.P. 62210, Mexico
| | - Roberto Garibay-Orijel
- Instituto de Biología, Universidad Nacional Autónoma de México, Tercer Circuito s/n, Ciudad Universitaria, Delegación Coyoacán, Ciudad de México, México, C.P. 04510, Mexico
| |
Collapse
|
20
|
Li O, Hackney JA, Choy DF, Chang D, Nersesian R, Staton TL, Cai F, Toghi Eshghi S. A targeted amplicon next-generation sequencing assay for tryptase genotyping to support personalized therapy in mast cell-related disorders. PLoS One 2024; 19:e0291947. [PMID: 38335181 PMCID: PMC10857577 DOI: 10.1371/journal.pone.0291947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 09/09/2023] [Indexed: 02/12/2024] Open
Abstract
Tryptase, the most abundant mast cell granule protein, is elevated in severe asthma patients independent of type 2 inflammation status. Higher active β tryptase allele counts are associated with higher levels of peripheral tryptase and lower clinical benefit from anti-IgE therapies. Tryptase is a therapeutic target of interest in severe asthma and chronic spontaneous urticaria. Active and inactive allele counts may enable stratification to assess response to therapies in asthmatic patient subpopulations. Tryptase gene loci TPSAB1 and TPSB2 have high levels of sequence identity, which makes genotyping a challenging task. Here, we report a targeted next-generation sequencing (NGS) assay and downstream bioinformatics analysis for determining polymorphisms at tryptase TPSAB1 and TPSB2 loci. Machine learning modeling using multiple polymorphisms in the tryptase loci was used to improve the accuracy of genotyping calls. The assay was tested and qualified on DNA extracted from whole blood of healthy donors and asthma patients, achieving accuracy of 96%, 96% and 94% for estimation of inactive α and βΙΙΙFS tryptase alleles and α duplication on TPSAB1, respectively. The reported NGS assay is a cost-effective method that is more efficient than Sanger sequencing and provides coverage to evaluate known as well as unreported tryptase polymorphisms.
Collapse
Affiliation(s)
- Olga Li
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - Jason A. Hackney
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - David F. Choy
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - Diana Chang
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - Rhea Nersesian
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - Tracy L. Staton
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - Fang Cai
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| | - Shadi Toghi Eshghi
- Genentech Research and Early Development, Genentech, Inc, South San Francisco, CA, United States of America
| |
Collapse
|
21
|
Silva-Pereira TT, Soler-Camargo NC, Guimarães AMS. Diversification of gene content in the Mycobacterium tuberculosis complex is determined by phylogenetic and ecological signatures. Microbiol Spectr 2024; 12:e0228923. [PMID: 38230932 PMCID: PMC10871547 DOI: 10.1128/spectrum.02289-23] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 12/19/2023] [Indexed: 01/18/2024] Open
Abstract
We analyzed the pan-genome and gene content modulation of the most diverse genome data set of the Mycobacterium tuberculosis complex (MTBC) gathered to date. The closed pan-genome of the MTBC was characterized by reduced accessory and strain-specific genomes, compatible with its clonal nature. However, significantly fewer gene families were shared between MTBC genomes as their phylogenetic distance increased. This effect was only observed in inter-species comparisons, not within-species, which suggests that species-specific ecological characteristics are associated with changes in gene content. Gene loss, resulting from genomic deletions and pseudogenization, was found to drive the variation in gene content. This gene erosion differed among MTBC species and lineages, even within M. tuberculosis, where L2 showed more gene loss than L4. We also show that phylogenetic proximity is not always a good proxy for gene content relatedness in the MTBC, as the gene repertoire of Mycobacterium africanum L6 deviated from its expected phylogenetic niche conservatism. Gene disruptions of virulence factors, represented by pseudogene annotations, are mostly not conserved, being poor predictors of MTBC ecotypes. Each MTBC ecotype carries its own accessory genome, likely influenced by distinct selective pressures such as host and geography. It is important to investigate how gene loss confer new adaptive traits to MTBC strains; the detected heterogeneous gene loss poses a significant challenge in elucidating genetic factors responsible for the diverse phenotypes observed in the MTBC. By detailing specific gene losses, our study serves as a resource for researchers studying the MTBC phenotypes and their immune evasion strategies.IMPORTANCEIn this study, we analyzed the gene content of different ecotypes of the Mycobacterium tuberculosis complex (MTBC), the pathogens of tuberculosis. We found that changes in their gene content are associated with their ecological features, such as host preference. Gene loss was identified as the primary driver of these changes, which can vary even among different strains of the same ecotype. Our study also revealed that the gene content relatedness of these bacteria does not always mirror their evolutionary relationships. In addition, some genes of virulence can be variably lost among strains of the same MTBC ecotype, likely helping them to evade the immune system. Overall, our study highlights the importance of understanding how gene loss can lead to new adaptations in these bacteria and how different selective pressures may influence their genetic makeup.
Collapse
Affiliation(s)
- Taiana Tainá Silva-Pereira
- Laboratory of Applied Research in Mycobacteria, Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Naila Cristina Soler-Camargo
- Laboratory of Applied Research in Mycobacteria, Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
- Department of Preventive Veterinary Medicine and Animal Health, School of Veterinary Medicine and Animal Sciences, University of São Paulo, São Paulo, Brazil
| | - Ana Marcia Sá Guimarães
- Laboratory of Applied Research in Mycobacteria, Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| |
Collapse
|
22
|
Cook R, Brown N, Rihtman B, Michniewski S, Redgwell T, Clokie M, Stekel DJ, Chen Y, Scanlan DJ, Hobman JL, Nelson A, Jones MA, Smith D, Millard A. The long and short of it: benchmarking viromics using Illumina, Nanopore and PacBio sequencing technologies. Microb Genom 2024; 10:001198. [PMID: 38376377 PMCID: PMC10926689 DOI: 10.1099/mgen.0.001198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Accepted: 01/25/2024] [Indexed: 02/21/2024] Open
Abstract
Viral metagenomics has fuelled a rapid change in our understanding of global viral diversity and ecology. Long-read sequencing and hybrid assembly approaches that combine long- and short-read technologies are now being widely implemented in bacterial genomics and metagenomics. However, the use of long-read sequencing to investigate viral communities is still in its infancy. While Nanopore and PacBio technologies have been applied to viral metagenomics, it is not known to what extent different technologies will impact the reconstruction of the viral community. Thus, we constructed a mock bacteriophage community of previously sequenced phage genomes and sequenced them using Illumina, Nanopore and PacBio sequencing technologies and tested a number of different assembly approaches. When using a single sequencing technology, Illumina assemblies were the best at recovering phage genomes. Nanopore- and PacBio-only assemblies performed poorly in comparison to Illumina in both genome recovery and error rates, which both varied with the assembler used. The best Nanopore assembly had errors that manifested as SNPs and INDELs at frequencies 41 and 157 % higher than found in Illumina only assemblies, respectively. While the best PacBio assemblies had SNPs at frequencies 12 and 78 % higher than found in Illumina-only assemblies, respectively. Despite high-read coverage, long-read-only assemblies recovered a maximum of one complete genome from any assembly, unless reads were down-sampled prior to assembly. Overall the best approach was assembly by a combination of Illumina and Nanopore reads, which reduced error rates to levels comparable with short-read-only assemblies. When using a single technology, Illumina only was the best approach. The differences in genome recovery and error rates between technology and assembler had downstream impacts on gene prediction, viral prediction, and subsequent estimates of diversity within a sample. These findings will provide a starting point for others in the choice of reads and assembly algorithms for the analysis of viromes.
Collapse
Affiliation(s)
- Ryan Cook
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, College Road, Loughborough, Leicestershire, LE12 5RD, UK
| | - Nathan Brown
- Centre for Phage Research, Dept Genetics and Genome Biology, University of Leicester, University Road, Leicester, Leicestershire, LE1 7RH, UK
| | - Branko Rihtman
- School of Life Sciences, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL, UK
| | - Slawomir Michniewski
- Warwick Medical School, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL, UK
| | - Tamsin Redgwell
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Ledreborg Alle 34, 2820, Gentofte, Denmark
| | - Martha Clokie
- Centre for Phage Research, Dept Genetics and Genome Biology, University of Leicester, University Road, Leicester, Leicestershire, LE1 7RH, UK
| | - Dov J. Stekel
- School of Biosciences, University of Nottingham, Sutton Bonington Campus, College Road, Loughborough, Leicestershire, LE12 5RD, UK
- Department of Mathematics and Applied Mathematics, University of Johannesburg, Rossmore 2029, South Africa
| | - Yin Chen
- School of Life Sciences, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL, UK
| | - David J. Scanlan
- School of Life Sciences, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL, UK
| | - Jon L. Hobman
- School of Biosciences, University of Nottingham, Sutton Bonington Campus, College Road, Loughborough, Leicestershire, LE12 5RD, UK
| | - Andrew Nelson
- Faculty of Health and Life Sciences, University of Northumbria, Newcastle upon Tyne, NE1 8ST, UK
| | - Michael A. Jones
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, College Road, Loughborough, Leicestershire, LE12 5RD, UK
| | - Darren Smith
- Faculty of Health and Life Sciences, University of Northumbria, Newcastle upon Tyne, NE1 8ST, UK
| | - Andrew Millard
- Centre for Phage Research, Dept Genetics and Genome Biology, University of Leicester, University Road, Leicester, Leicestershire, LE1 7RH, UK
| |
Collapse
|
23
|
Cerk K, Ugalde‐Salas P, Nedjad CG, Lecomte M, Muller C, Sherman DJ, Hildebrand F, Labarthe S, Frioux C. Community-scale models of microbiomes: Articulating metabolic modelling and metagenome sequencing. Microb Biotechnol 2024; 17:e14396. [PMID: 38243750 PMCID: PMC10832553 DOI: 10.1111/1751-7915.14396] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 11/27/2023] [Accepted: 12/20/2023] [Indexed: 01/21/2024] Open
Abstract
Building models is essential for understanding the functions and dynamics of microbial communities. Metabolic models built on genome-scale metabolic network reconstructions (GENREs) are especially relevant as a means to decipher the complex interactions occurring among species. Model reconstruction increasingly relies on metagenomics, which permits direct characterisation of naturally occurring communities that may contain organisms that cannot be isolated or cultured. In this review, we provide an overview of the field of metabolic modelling and its increasing reliance on and synergy with metagenomics and bioinformatics. We survey the means of assigning functions and reconstructing metabolic networks from (meta-)genomes, and present the variety and mathematical fundamentals of metabolic models that foster the understanding of microbial dynamics. We emphasise the characterisation of interactions and the scaling of model construction to large communities, two important bottlenecks in the applicability of these models. We give an overview of the current state of the art in metagenome sequencing and bioinformatics analysis, focusing on the reconstruction of genomes in microbial communities. Metagenomics benefits tremendously from third-generation sequencing, and we discuss the opportunities of long-read sequencing, strain-level characterisation and eukaryotic metagenomics. We aim at providing algorithmic and mathematical support, together with tool and application resources, that permit bridging the gap between metagenomics and metabolic modelling.
Collapse
Affiliation(s)
- Klara Cerk
- Quadram Institute BioscienceNorwichUK
- Earlham InstituteNorwichUK
| | | | - Chabname Ghassemi Nedjad
- Inria, University of Bordeaux, INRAETalenceFrance
- University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800TalenceFrance
| | - Maxime Lecomte
- Inria, University of Bordeaux, INRAETalenceFrance
- INRAE STLO¸University of RennesRennesFrance
| | | | | | - Falk Hildebrand
- Quadram Institute BioscienceNorwichUK
- Earlham InstituteNorwichUK
| | - Simon Labarthe
- Inria, University of Bordeaux, INRAETalenceFrance
- INRAE, University of Bordeaux, BIOGECO, UMR 1202CestasFrance
| | | |
Collapse
|
24
|
Nestor BJ, Bayer PE, Fernandez CGT, Edwards D, Finnegan PM. Approaches to increase the validity of gene family identification using manual homology search tools. Genetica 2023; 151:325-338. [PMID: 37817002 PMCID: PMC10692271 DOI: 10.1007/s10709-023-00196-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/01/2023] [Indexed: 10/12/2023]
Abstract
Identifying homologs is an important process in the analysis of genetic patterns underlying traits and evolutionary relationships among species. Analysis of gene families is often used to form and support hypotheses on genetic patterns such as gene presence, absence, or functional divergence which underlie traits examined in functional studies. These analyses often require precise identification of all members in a targeted gene family. Manual pipelines where homology search and orthology assignment tools are used separately are the most common approach for identifying small gene families where accurate identification of all members is important. The ability to curate sequences between steps in manual pipelines allows for simple and precise identification of all possible gene family members. However, the validity of such manual pipeline analyses is often decreased by inappropriate approaches to homology searches including too relaxed or stringent statistical thresholds, inappropriate query sequences, homology classification based on sequence similarity alone, and low-quality proteome or genome sequences. In this article, we propose several approaches to mitigate these issues and allow for precise identification of gene family members and support for hypotheses linking genetic patterns to functional traits.
Collapse
Affiliation(s)
- Benjamin J Nestor
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia.
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia.
| | - Philipp E Bayer
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - Cassandria G Tay Fernandez
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - David Edwards
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - Patrick M Finnegan
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| |
Collapse
|
25
|
Koo H, Lee GW, Ko SR, Go S, Kwon SY, Kim YM, Shin AY. Two long read-based genome assembly and annotation of polyploidy woody plants, Hibiscus syriacus L. using PacBio and Nanopore platforms. Sci Data 2023; 10:713. [PMID: 37853021 PMCID: PMC10584963 DOI: 10.1038/s41597-023-02631-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 10/11/2023] [Indexed: 10/20/2023] Open
Abstract
Improvements in long read DNA sequencing and related techniques facilitated the generation of complex eukaryotic genomes. Despite these advances, the quality of constructed plant reference genomes remains relatively poor due to the large size of genomes, high content of repetitive sequences, and wide variety of ploidy. Here, we developed the de novo sequencing and assembly of high polyploid plant genome, Hibiscus syriacus, a flowering plant species of the Malvaceae family, using the Oxford Nanopore Technologies and Pacific Biosciences Sequel sequencing platforms. We investigated an efficient combination of high-quality and high-molecular-weight DNA isolation procedure and suitable assembler to achieve optimal results using long read sequencing data. We found that abundant ultra-long reads allow for large and complex polyploid plant genome assemblies with great recovery of repetitive sequences and error correction even at relatively low depth Nanopore sequencing data and polishing compared to previous studies. Collectively, our combination provides cost effective methods to improve genome continuity and quality compared to the previously reported reference genome by accessing highly repetitive regions. The application of this combination may enable genetic research and breeding of polyploid crops, thus leading to improvements in crop production.
Collapse
Affiliation(s)
- Hyunjin Koo
- Plant Systems Engineering Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
| | - Gir-Won Lee
- SML Genetree Co. Ltd., Seoul, 05855, Republic of Korea
| | - Seo-Rin Ko
- Plant Systems Engineering Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Biosystems and Bioengineering Program, University of Science and Technology, Daejeon, 34113, Korea
| | - Sangjin Go
- Plant Systems Engineering Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Biosystems and Bioengineering Program, University of Science and Technology, Daejeon, 34113, Korea
| | - Suk-Yoon Kwon
- Plant Systems Engineering Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Biosystems and Bioengineering Program, University of Science and Technology, Daejeon, 34113, Korea
| | - Yong-Min Kim
- Plant Systems Engineering Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
- Department of Bioinformatics, KRIBB School of Bioscience, Korea University of Science and Technology (UST), Daejeon, 34141, Republic of Korea.
- Digital Biotech Innovation Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
| | - Ah-Young Shin
- Plant Systems Engineering Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
- Department of Bioinformatics, KRIBB School of Bioscience, Korea University of Science and Technology (UST), Daejeon, 34141, Republic of Korea.
| |
Collapse
|
26
|
Li K, Xu P, Wang J, Yi X, Jiao Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat Commun 2023; 14:6556. [PMID: 37848433 PMCID: PMC10582259 DOI: 10.1038/s41467-023-42336-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/05/2023] [Indexed: 10/19/2023] Open
Abstract
Assembly of a high-quality genome is important for downstream comparative and functional genomic studies. However, most tools for genome assembly assessment only give qualitative reports, which do not pinpoint assembly errors at specific regions. Here, we develop a new reference-free tool, Clipping information for Revealing Assembly Quality (CRAQ), which maps raw reads back to assembled sequences to identify regional and structural assembly errors based on effective clipped alignment information. Error counts are transformed into corresponding assembly evaluation indexes to reflect the assembly quality at single-nucleotide resolution. Notably, CRAQ distinguishes assembly errors from heterozygous sites or structural differences between haplotypes. This tool can clearly indicate low-quality regions and potential structural error breakpoints; thus, it can identify misjoined regions that should be split for further scaffold building and improvement of the assembly. We have benchmarked CRAQ on multiple genomes assembled using different strategies, and demonstrated the misjoin correction for improving the constructed pseudomolecules.
Collapse
Affiliation(s)
- Kunpeng Li
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Peng Xu
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jinpeng Wang
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xin Yi
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- China National Botanical Garden, Beijing, China
| | - Yuannian Jiao
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
- China National Botanical Garden, Beijing, China.
| |
Collapse
|
27
|
Mochizuki T, Sakamoto M, Tanizawa Y, Nakayama T, Tanifuji G, Kamikawa R, Nakamura Y. A practical assembly guideline for genomes with various levels of heterozygosity. Brief Bioinform 2023; 24:bbad337. [PMID: 37798248 PMCID: PMC10555665 DOI: 10.1093/bib/bbad337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 08/06/2023] [Accepted: 09/03/2023] [Indexed: 10/07/2023] Open
Abstract
Although current long-read sequencing technologies have a long-read length that facilitates assembly for genome reconstruction, they have high sequence errors. While various assemblers with different perspectives have been developed, no systematic evaluation of assemblers with long reads for diploid genomes with varying heterozygosity has been performed. Here, we evaluated a series of processes, including the estimation of genome characteristics such as genome size and heterozygosity, de novo assembly, polishing, and removal of allelic contigs, using six genomes with various heterozygosity levels. We evaluated five long-read-only assemblers (Canu, Flye, miniasm, NextDenovo and Redbean) and five hybrid assemblers that combine short and long reads (HASLR, MaSuRCA, Platanus-allee, SPAdes and WENGAN) and proposed a concrete guideline for the construction of haplotype representation according to the degree of heterozygosity, followed by polishing and purging haplotigs, using stable and high-performance assemblers: Redbean, Flye and MaSuRCA.
Collapse
Affiliation(s)
| | - Mika Sakamoto
- Genome Informatics Laboratory, National Institute of Genetics
| | | | - Takuro Nakayama
- Division of Life Sciences Center for Computational Sciences, University of Tsukuba, Japan
| | - Goro Tanifuji
- Department of Zoology, National Museum of Nature and Science
| | | | | |
Collapse
|
28
|
Wang J, Veldsman WP, Fang X, Huang Y, Xie X, Lyu A, Zhang L. Benchmarking multi-platform sequencing technologies for human genome assembly. Brief Bioinform 2023; 24:bbad300. [PMID: 37594299 DOI: 10.1093/bib/bbad300] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Revised: 07/12/2023] [Accepted: 07/26/2023] [Indexed: 08/19/2023] Open
Abstract
Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.
Collapse
Affiliation(s)
- Jingjing Wang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Werner Pieter Veldsman
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | | | | | | | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| |
Collapse
|
29
|
Yu R, Abdullah SMU, Sun Y. HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses. Brief Bioinform 2023; 24:bbad264. [PMID: 37478372 PMCID: PMC10516367 DOI: 10.1093/bib/bbad264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Revised: 06/05/2023] [Accepted: 06/29/2023] [Indexed: 07/23/2023] Open
Abstract
Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.
Collapse
Affiliation(s)
- Runzhou Yu
- Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | | | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
30
|
Baker JL. Illuminating the oral microbiome and its host interactions: recent advancements in omics and bioinformatics technologies in the context of oral microbiome research. FEMS Microbiol Rev 2023; 47:fuad051. [PMID: 37667515 PMCID: PMC10503653 DOI: 10.1093/femsre/fuad051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 08/02/2023] [Accepted: 09/01/2023] [Indexed: 09/06/2023] Open
Abstract
The oral microbiota has an enormous impact on human health, with oral dysbiosis now linked to many oral and systemic diseases. Recent advancements in sequencing, mass spectrometry, bioinformatics, computational biology, and machine learning are revolutionizing oral microbiome research, enabling analysis at an unprecedented scale and level of resolution using omics approaches. This review contains a comprehensive perspective of the current state-of-the-art tools available to perform genomics, metagenomics, phylogenomics, pangenomics, transcriptomics, proteomics, metabolomics, lipidomics, and multi-omics analysis on (all) microbiomes, and then provides examples of how the techniques have been applied to research of the oral microbiome, specifically. Key findings of these studies and remaining challenges for the field are highlighted. Although the methods discussed here are placed in the context of their contributions to oral microbiome research specifically, they are pertinent to the study of any microbiome, and the intended audience of this includes researchers would simply like to get an introduction to microbial omics and/or an update on the latest omics methods. Continued research of the oral microbiota using omics approaches is crucial and will lead to dramatic improvements in human health, longevity, and quality of life.
Collapse
Affiliation(s)
- Jonathon L Baker
- Department of Oral Rehabilitation & Biosciences, School of Dentistry, Oregon Health & Science University, 3181 Sam Jackson Park Road, Portland, OR 97202, United States
- Genomic Medicine Group, J. Craig Venter Institute, La Jolla, CA 92037, United States
- Department of Pediatrics, UC San Diego School of Medicine, La Jolla, CA 92093, United States
| |
Collapse
|
31
|
Arredondo-Alonso S, Gladstone R, Pöntinen A, Gama J, Schürch A, Lanza V, Johnsen P, Samuelsen Ø, Tonkin-Hill G, Corander J. Mge-cluster: a reference-free approach for typing bacterial plasmids. NAR Genom Bioinform 2023; 5:lqad066. [PMID: 37435357 PMCID: PMC10331934 DOI: 10.1093/nargab/lqad066] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 06/08/2023] [Accepted: 06/26/2023] [Indexed: 07/13/2023] Open
Abstract
Extrachromosomal elements of bacterial cells such as plasmids are notorious for their importance in evolution and adaptation to changing ecology. However, high-resolution population-wide analysis of plasmids has only become accessible recently with the advent of scalable long-read sequencing technology. Current typing methods for the classification of plasmids remain limited in their scope which motivated us to develop a computationally efficient approach to simultaneously recognize novel types and classify plasmids into previously identified groups. Here, we introduce mge-cluster that can easily handle thousands of input sequences which are compressed using a unitig representation in a de Bruijn graph. Our approach offers a faster runtime than existing algorithms, with moderate memory usage, and enables an intuitive visualization, classification and clustering scheme that users can explore interactively within a single framework. Mge-cluster platform for plasmid analysis can be easily distributed and replicated, enabling a consistent labelling of plasmids across past, present, and future sequence collections. We underscore the advantages of our approach by analysing a population-wide plasmid data set obtained from the opportunistic pathogen Escherichia coli, studying the prevalence of the colistin resistance gene mcr-1.1 within the plasmid population, and describing an instance of resistance plasmid transmission within a hospital environment.
Collapse
Affiliation(s)
| | | | - Anna K Pöntinen
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Norwegian National Advisory Unit on Detection of Antimicrobial Resistance, Department of Microbiology and Infection Control, University Hospital of North Norway, Tromsø, Norway
| | - João A Gama
- Department of Pharmacy, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway
| | - Anita C Schürch
- Department of Medical Microbiology, UMC Utrecht, Utrecht, The Netherlands
| | - Val F Lanza
- CIBERINFEC, Madrid, Spain
- Bioinformatics Unit, University Hospital Ramón y Cajal, IRYCIS, Madrid, Spain
| | - Pål Jarle Johnsen
- Department of Pharmacy, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway
| | - Ørjan Samuelsen
- Norwegian National Advisory Unit on Detection of Antimicrobial Resistance, Department of Microbiology and Infection Control, University Hospital of North Norway, Tromsø, Norway
- Department of Pharmacy, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway
| | - Gerry Tonkin-Hill
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | - Jukka Corander
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
- Department of Mathematics and Statistics, Helsinki Institute of Information Technology (HIIT), FI-00014 University of Helsinki, Helsinki, Finland
| |
Collapse
|
32
|
Zhang Z, Li C, Li Q, Su X, Li J, Zhu L, Lin XJ, Shen J. Structure prediction of novel isoforms from uveal melanoma by AlphaFold. Sci Data 2023; 10:513. [PMID: 37542084 PMCID: PMC10403560 DOI: 10.1038/s41597-023-02429-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 07/28/2023] [Indexed: 08/06/2023] Open
Abstract
Alternative splicing is an important mechanism that enhances protein functional diversity. To date, our understanding of alternative splicing variants has been based on mRNA transcript data, but due to the difficulty in predicting protein structures, protein tertiary structures have been largely unexplored. However, with the release of AlphaFold, which predicts three-dimensional models of proteins, this challenge is rapidly being overcome. Here, we present a dataset of 315 predicted structures of abnormal isoforms in 18 uveal melanoma patients based on second- and third-generation transcriptome-sequencing data. This information comprises a high-quality set of structural data on recurrent aberrant isoforms that can be used in multiple types of studies, from those aimed at revealing potential therapeutic targets to those aimed at recognizing of cancer neoantigens at the atomic level.
Collapse
Affiliation(s)
- Zhe Zhang
- Department of Ophthalmology, Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
- Shanghai Key Laboratory of Orbital Diseases and Ocular Oncology, Shanghai, 200025, China.
- Institute of Translational Medicine, National Facility for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China.
| | - Chen Li
- High Performance Computing Center, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Qian Li
- Department of Ophthalmology, Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
- Shanghai Key Laboratory of Orbital Diseases and Ocular Oncology, Shanghai, 200025, China
- Institute of Translational Medicine, National Facility for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xiaoming Su
- High Performance Computing Center, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Jiayi Li
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Lili Zhu
- Songjiang Research Institute and Songjiang Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 201600, China
| | - Xinhua James Lin
- High Performance Computing Center, Shanghai Jiao Tong University, Shanghai, 200240, China.
| | - Jianfeng Shen
- Department of Ophthalmology, Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
- Shanghai Key Laboratory of Orbital Diseases and Ocular Oncology, Shanghai, 200025, China.
- Institute of Translational Medicine, National Facility for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
33
|
Ruiz JL, Reimering S, Escobar-Prieto JD, Brancucci NMB, Echeverry DF, Abdi AI, Marti M, Gómez-Díaz E, Otto TD. From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA). Brief Bioinform 2023; 24:bbad248. [PMID: 37406192 PMCID: PMC10359078 DOI: 10.1093/bib/bbad248] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 05/24/2023] [Accepted: 06/16/2023] [Indexed: 07/07/2023] Open
Abstract
Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.
Collapse
Affiliation(s)
- José Luis Ruiz
- Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), Consejo Superior de Investigaciones Científicas, 18016, Granada, Spain
| | - Susanne Reimering
- Department for Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Nicolas M B Brancucci
- School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK
- Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, 4123 Allschwil, Switzerland
- University of Basel, 4001 Basel, Switzerland
| | - Diego F Echeverry
- Centro Internacional de Entrenamiento e Investigaciones Médicas (CIDEIM), Cali, Colombia
- Departamento de Microbiología, Facultad de Salud, Universidad del Valle, Cali, Colombia
| | | | - Matthias Marti
- School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK
| | - Elena Gómez-Díaz
- Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), Consejo Superior de Investigaciones Científicas, 18016, Granada, Spain
| | - Thomas D Otto
- School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK
| |
Collapse
|
34
|
Vuruputoor VS, Monyak D, Fetter KC, Webster C, Bhattarai A, Shrestha B, Zaman S, Bennett J, McEvoy SL, Caballero M, Wegrzyn JL. Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes. APPLICATIONS IN PLANT SCIENCES 2023; 11:e11533. [PMID: 37601314 PMCID: PMC10439824 DOI: 10.1002/aps3.11533] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Revised: 02/04/2023] [Accepted: 02/10/2023] [Indexed: 08/22/2023]
Abstract
Premise Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions. Methods The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity. Results Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.
Collapse
Affiliation(s)
- Vidya S. Vuruputoor
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Daniel Monyak
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Karl C. Fetter
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Cynthia Webster
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Akriti Bhattarai
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Bikash Shrestha
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Sumaira Zaman
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Jeremy Bennett
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Susan L. McEvoy
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Madison Caballero
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| | - Jill L. Wegrzyn
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut06269USA
| |
Collapse
|
35
|
Velasco-Amo MP, Arias-Giraldo LF, Román-Écija M, Fuente LDL, Marco-Noales E, Moralejo E, Navas-Cortés JA, Landa BB. Complete Circularized Genome Resources of Seven Strains of Xylella fastidiosa subsp. fastidiosa Using Hybrid Assembly Reveals Unknown Plasmids. PHYTOPATHOLOGY 2023; 113:1128-1132. [PMID: 36441872 DOI: 10.1094/phyto-10-22-0396-a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Xylella fastidiosa is a vascular plant pathogenic bacterium native to the Americas that is causing significant epidemics and economic losses in olive and almonds in Europe, where it is a quarantine pathogen. Since its first detection in 2013 in Italy, mandatory surveys across Europe revealed the presence of the bacterium also in France, Spain, and Portugal. Combining Oxford Nanopore Technologies and Illumina sequencing data, we assembled high-quality complete genomes of seven X. fastidiosa subsp. fastidiosa strains isolated from different plants in Spain, the United States, and Mexico. Comparative genomic analyses discovered differences in plasmid content among strains, including plasmids that had been overlooked previously when using the Illumina sequencing platform alone. Interestingly, in strain CFBP8073, intercepted in France from plants imported from Mexico, three plasmids were identified, including two (plasmids pXF-P1.CFBP8073 and pXF-P2.CFBP8073) not previously described in X. fastidiosa and one (pXF5823.CFBP8073) almost identical to a plasmid described in a X. fastidiosa strain from citrus. Plasmids found in the Spanish strains here were similar to those described previously in other strains from the same subspecies and ST1 isolated in the Balearic Islands and the United States. The genome resources from this work will assist in further studies on the role of plasmids in the epidemiology, ecology, and evolution of this plant pathogen.
Collapse
Affiliation(s)
- María Pilar Velasco-Amo
- Instituto de Agricultura Sostenible, Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain
| | - Luis F Arias-Giraldo
- Instituto de Agricultura Sostenible, Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain
| | - Miguel Román-Écija
- Instituto de Agricultura Sostenible, Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain
| | - Leonardo De La Fuente
- Department of Entomology and Plant Pathology, Auburn University, Auburn, AL 36849, U.S.A
| | - Ester Marco-Noales
- Centro de Protección Vegetal y Biotecnología, Instituto Valenciano de Investigaciones Agrarias (IVIA), Moncada, Spain
| | - Eduardo Moralejo
- Tragsa, Empresa de Transformación Agraria, Delegación de Baleares, 07005 Palma, Spain
| | - Juan A Navas-Cortés
- Instituto de Agricultura Sostenible, Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain
| | - Blanca B Landa
- Instituto de Agricultura Sostenible, Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain
| |
Collapse
|
36
|
Wagner GE, Dabernig-Heinz J, Lipp M, Cabal A, Simantzik J, Kohl M, Scheiber M, Lichtenegger S, Ehricht R, Leitner E, Ruppitsch W, Steinmetz I. Real-Time Nanopore Q20+ Sequencing Enables Extremely Fast and Accurate Core Genome MLST Typing and Democratizes Access to High-Resolution Bacterial Pathogen Surveillance. J Clin Microbiol 2023; 61:e0163122. [PMID: 36988494 PMCID: PMC10117118 DOI: 10.1128/jcm.01631-22] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 02/17/2023] [Indexed: 03/30/2023] Open
Abstract
Next-generation whole-genome sequencing is essential for high-resolution surveillance of bacterial pathogens, for example, during outbreak investigations or for source tracking and escape variant analysis. However, current global sequencing and bioinformatic bottlenecks and a long time to result with standard technologies demand new approaches. In this study, we investigated whether novel nanopore Q20+ long-read chemistry enables standardized and easily accessible high-resolution typing combined with core genome multilocus sequence typing (cgMLST). We set high requirements for discriminatory power by using the slowly evolving bacterium Bordetella pertussis as a model pathogen. Our results show that the increased raw read accuracy enables the description of epidemiological scenarios and phylogenetic linkages at the level of gold-standard short reads. The same was true for our variant analysis of vaccine antigens, resistance genes, and virulence factors, demonstrating that nanopore sequencing is a legitimate competitor in the area of next-generation sequencing (NGS)-based high-resolution bacterial typing. Furthermore, we evaluated the parameters for the fastest possible analysis of the data. By combining the optimized processing pipeline with real-time basecalling, we established a workflow that allows for highly accurate and extremely fast high-resolution typing of bacterial pathogens while sequencing is still in progress. Along with advantages such as low costs and portability, the approach suggested here might democratize modern bacterial typing, enabling more efficient infection control globally.
Collapse
Affiliation(s)
- Gabriel E. Wagner
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Johanna Dabernig-Heinz
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Michaela Lipp
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Adriana Cabal
- Austrian Agency for Health and Food Safety, Vienna, Austria
| | - Jonathan Simantzik
- Medical and Life Sciences Faculty, Furtwangen University, Villingen-Schwenningen, Germany
| | - Matthias Kohl
- Medical and Life Sciences Faculty, Furtwangen University, Villingen-Schwenningen, Germany
| | - Martina Scheiber
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Sabine Lichtenegger
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Ralf Ehricht
- InfectoGnostics Research Campus, Centre for Applied Research, Jena, Germany
- Leibniz-Institute of Photonic Technology (Leibniz-IPHT), Jena, Germany
- Friedrich Schiller University Jena, Institute of Physical Chemistry, Jena, Germany
| | - Eva Leitner
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | | | - Ivo Steinmetz
- Diagnostic and Research Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| |
Collapse
|
37
|
Xia Y, Li X, Wu Z, Nie C, Cheng Z, Sun Y, Liu L, Zhang T. Strategies and tools in illumina and nanopore-integrated metagenomic analysis of microbiome data. IMETA 2023; 2:e72. [PMID: 38868337 PMCID: PMC10989838 DOI: 10.1002/imt2.72] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 11/10/2022] [Accepted: 11/28/2022] [Indexed: 06/14/2024]
Abstract
Metagenomic strategy serves as the foundation for the ecological exploration of novel bioresources (e.g., industrial enzymes and bioactive molecules) and biohazards (e.g., pathogens and antibiotic resistance genes) in natural and engineered microbial systems across multiple disciplines. Recent advancements in sequencing technology have fostered rapid development in the field of microbiome research where an increasing number of studies have applied both illumina short reads (SRs) and nanopore long reads (LRs) sequencing in their metagenomic workflow. However, given the high complexity of an environmental microbiome data set and the bioinformatic challenges caused by the unique features of these sequencing technologies, integrating SRs and LRs is not as straightforward as one might assume. The fast renewal of existing tools and growing diversity of new algorithms make access to this field even more difficult. Therefore, here we systematically summarized the complete workflow from DNA extraction to data processing strategies for applying illumina and nanopore-integrated metagenomics in the investigation in environmental microbiomes. Overall, this review aims to provide a timely knowledge framework for researchers that are interested in or are struggling with the SRs and LRs integration in their metagenomic analysis. The discussions presented will facilitate improved ecological understanding of community functionalities and assembly of natural, engineered, and human microbiomes, benefiting researchers from multiple disciplines.
Collapse
Affiliation(s)
- Yu Xia
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
- State Environmental Protection Key Laboratory of Integrated Surface Water‐Groundwater Pollution Control, School of Environmental Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
- Guangdong Provincial Key Laboratory of Soil and Groundwater Pollution Control, School of Environmental Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Xiang Li
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Ziqi Wu
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Cailong Nie
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Zhanwen Cheng
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Yuhong Sun
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Lei Liu
- Environmental Microbiome Engineering and Biotechnology LaboratoryThe University of Hong KongHong Kong SARChina
| | - Tong Zhang
- Environmental Microbiome Engineering and Biotechnology LaboratoryThe University of Hong KongHong Kong SARChina
| |
Collapse
|
38
|
Kovaka S, Ou S, Jenike KM, Schatz MC. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nat Methods 2023; 20:12-16. [PMID: 36635537 PMCID: PMC10068675 DOI: 10.1038/s41592-022-01716-8] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The year 2022 will be remembered as the turning point for accurate long-read sequencing, which now establishes the gold standard for speed and accuracy at competitive costs. We discuss the key bioinformatics techniques needed to power long reads across application areas and close with our vision for long-read sequencing over the coming years.
Collapse
Affiliation(s)
- Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Shujun Ou
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Molecular Genetics, Ohio State University, Columbus, OH, USA
| | - Katharine M Jenike
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
39
|
Arumugam K, Bessarab I, Haryono MAS, Williams RBH. Recovery and Analysis of Long-Read Metagenome-Assembled Genomes. Methods Mol Biol 2023; 2649:235-259. [PMID: 37258866 DOI: 10.1007/978-1-0716-3072-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The development of long-read nucleic acid sequencing is beginning to make very substantive impact on the conduct of metagenome analysis, particularly in relation to the problem of recovering the genomes of member species of complex microbial communities. Here we outline bioinformatics workflows for the recovery and characterization of complete genomes from long-read metagenome data and some complementary procedures for comparison of cognate draft genomes and gene quality obtained from short-read sequencing and long-read sequencing.
Collapse
Affiliation(s)
- Krithika Arumugam
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Irina Bessarab
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore
| | - Mindia A S Haryono
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore
| | - Rohan B H Williams
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
40
|
Jin H, Quan K, He Q, Kwok LY, Ma T, Li Y, Zhao F, You L, Zhang H, Sun Z. A high-quality genome compendium of the human gut microbiome of Inner Mongolians. Nat Microbiol 2023; 8:150-161. [PMID: 36604505 DOI: 10.1038/s41564-022-01270-1] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 10/13/2022] [Indexed: 01/07/2023]
Abstract
Metagenome-based resources have revealed the diversity and function of the human gut microbiome, but further understanding is limited by insufficient genome quality and a lack of samples from typically understudied populations. Here we used hybrid long-read PromethION and short-read HiSeq sequencing to characterize the faecal microbiota of 60 Inner Mongolian individuals (n = 180 samples over three time points) who were part of a probiotic yogurt intervention trial. We present the Inner Mongolian Gut Genome catalogue, comprising 802 closed and 5,927 high-quality metagenome-assembled genomes. This approach achieved high genome continuity and substantially increased the resolution of genomic elements, including ribosomal RNA operons, metabolic gene clusters, prophages and insertion sequences. Particularly, we report the ribosomal RNA operon copy numbers for uncultured species, over 12,000 previously undescribed gut prophages and the distribution of insertion sequence elements across gut bacteria. Overall, these data provide a high-quality, large-scale resource for studying the human gut microbiota.
Collapse
Affiliation(s)
- Hao Jin
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Keyu Quan
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Qiuwen He
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Lai-Yu Kwok
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Teng Ma
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Yalin Li
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Feiyan Zhao
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Lijun You
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
| | - Heping Zhang
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China. .,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China. .,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.
| | - Zhihong Sun
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China. .,Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China. .,Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China.
| |
Collapse
|
41
|
Liu L, Yang Y, Deng Y, Zhang T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. MICROBIOME 2022; 10:209. [PMID: 36457010 PMCID: PMC9716684 DOI: 10.1186/s40168-022-01415-8] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 11/07/2022] [Indexed: 05/31/2023]
Abstract
BACKGROUND The accurate and comprehensive analyses of genome-resolved metagenomics largely depend on the reconstruction of reference-quality (complete and high-quality) genomes from diverse microbiomes. Closing gaps in draft genomes have been approaching with the inclusion of Nanopore long reads; however, genome quality improvement requires extensive and time-consuming high-accuracy short-read polishing. RESULTS Here, we introduce NanoPhase, an open-source tool to reconstruct reference-quality genomes from complex metagenomes using only Nanopore long reads. Using Kit 9 and Q20+ chemistries, we first evaluated the feasibility of NanoPhase using a ZymoBIOMICS gut microbiome standard (including 21 strains), then sequenced the complex activated sludge microbiome and reconstructed 275 MAGs with median completeness of ~ 90%. As a result, NanoPhase improved the MAG contiguity (median MAG N50: 735 Kb, 44-86X compared to conventional short-read-based methods) while maintaining high accuracy, allowing for a full and accurate investigation of target microbiomes. Additionally, leveraging these high-contiguity reference-quality genomes, we identified 165 prophages within 111 MAGs, with 5 as active prophages, indicating the prophage was a neglected source of genetic diversity within microbial populations and influencer in shaping microbial composition in the activated sludge microbiome. CONCLUSIONS Our results demonstrated that NanoPhase enables reference-quality genome reconstruction from complex metagenomes directly using only Nanopore long reads. Furthermore, besides the 16S rRNA genes and biosynthetic gene clusters, the generated high-accuracy and high-contiguity MAGs improved the host identification of critical mobile genetic elements, e.g., prophage, serving as a genomic blueprint to investigate the microbial potential and ecology in the activated sludge ecosystem. Video Abstract.
Collapse
Affiliation(s)
- Lei Liu
- Environmental Microbiome Engineering and Biotechnology Laboratory, Center for Environmental Engineering Research, Department of Civil Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Yu Yang
- Environmental Microbiome Engineering and Biotechnology Laboratory, Center for Environmental Engineering Research, Department of Civil Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Yu Deng
- Environmental Microbiome Engineering and Biotechnology Laboratory, Center for Environmental Engineering Research, Department of Civil Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Tong Zhang
- Environmental Microbiome Engineering and Biotechnology Laboratory, Center for Environmental Engineering Research, Department of Civil Engineering, The University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
42
|
Iyengar BR, Wagner A. Bacterial Hsp90 predominantly buffers but does not potentiate the phenotypic effects of deleterious mutations during fluorescent protein evolution. Genetics 2022; 222:iyac154. [PMID: 36227141 PMCID: PMC9713429 DOI: 10.1093/genetics/iyac154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 09/26/2022] [Indexed: 12/13/2022] Open
Abstract
Chaperones facilitate the folding of other ("client") proteins and can thus affect the adaptive evolution of these clients. Specifically, chaperones affect the phenotype of proteins via two opposing mechanisms. On the one hand, they can buffer the effects of mutations in proteins and thus help preserve an ancestral, premutation phenotype. On the other hand, they can potentiate the effects of mutations and thus enhance the phenotypic changes caused by a mutation. We study that how the bacterial Hsp90 chaperone (HtpG) affects the evolution of green fluorescent protein. To this end, we performed directed evolution of green fluorescent protein under low and high cellular concentrations of Hsp90. Specifically, we evolved green fluorescent protein under both stabilizing selection for its ancestral (green) phenotype and directional selection toward a new (cyan) phenotype. While Hsp90 did only affect the rate of adaptive evolution transiently, it did affect the phenotypic effects of mutations that occurred during adaptive evolution. Specifically, Hsp90 allowed strongly deleterious mutations to accumulate in evolving populations by buffering their effects. Our observations show that the role of a chaperone for adaptive evolution depends on the organism and the trait being studied.
Collapse
Affiliation(s)
- Bharat Ravi Iyengar
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, 8057 Zurich, Switzerland
- Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, 1015 Lausanne, Switzerland
- Institute for Evolution and Biodiversity, Westfalian Wilhelms—University of Münster, 48149 Münster, Germany
| | - Andreas Wagner
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, 8057 Zurich, Switzerland
- Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, 1015 Lausanne, Switzerland
- The Santa Fe Institute, Santa Fe, NM 87501, USA
- Stellenbosch Institute for Advanced Study (STIAS), Wallenberg Research Centre at Stellenbosch University, 7600 Stellenbosch, South Africa
| |
Collapse
|
43
|
Rayamajhi N, Cheng CHC, Catchen JM. Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki. G3 (BETHESDA, MD.) 2022; 12:jkac192. [PMID: 35904764 PMCID: PMC9635638 DOI: 10.1093/g3journal/jkac192] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 07/18/2022] [Indexed: 11/16/2022]
Abstract
For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least 3 phases: (1) short-read only, (2) short- and long-read hybrid, and (3) long-read only assemblies. Each of the phases has its own error model. We hypothesized that hidden short-read scaffolding errors and erroneous long-read contigs degrade the quality of short- and long-read hybrid assemblies. We assembled the genome of Trematomus borchgrevinki from data generated during each of the 3 phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer-based strategy improved short-read assemblies as measured by Benchmarking Universal Single-Copy Ortholog while mate-pair libraries introduced hidden scaffolding errors and perturbed Benchmarking Universal Single-Copy Ortholog scores. Furthermore, we found that although hybrid assemblies can generate higher contiguity they tend to suffer from lower quality. In addition, we found long-read-only assemblies can be optimized for contiguity by subsampling length-restricted raw reads. Our results indicate that long-read contig assembly is the current best choice and that assemblies from phase I and phase II were of lower quality.
Collapse
Affiliation(s)
- Niraj Rayamajhi
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| | - Chi-Hing Christina Cheng
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| | - Julian M Catchen
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| |
Collapse
|
44
|
Holt GS, Batty LE, Alobaidi BKS, Smith HE, Oud MS, Ramos L, Xavier MJ, Veltman JA. Phasing of de novo mutations using a scaled-up multiple amplicon long-read sequencing approach. Hum Mutat 2022; 43:1545-1556. [PMID: 36047340 PMCID: PMC9826063 DOI: 10.1002/humu.24450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/11/2022] [Accepted: 08/18/2022] [Indexed: 01/11/2023]
Abstract
De novo mutations (DNMs) play an important role in severe genetic disorders that reduce fitness. To better understand their role in disease, it is important to determine the parent-of-origin and timing of mutational events that give rise to these mutations, especially in sex-specific developmental disorders such as male infertility. However, currently available short-read sequencing approaches are not ideally suited for phasing, as this requires long continuous DNA strands that span both the DNM and one or more informative single-nucleotide polymorphisms. To overcome these challenges, we optimized and implemented a multiplexed long-read sequencing approach using Oxford Nanopore technologies MinION platform. We focused on improving target amplification, integrating long-read sequenced data with high-quality short-read sequence data, and developing an anchored phasing computational method. This approach handled the inherent phasing challenges of long-range target amplification and the normal accumulation of sequencing error associated with long-read sequencing. In total, 77 of 109 DNMs (71%) were successfully phased and parent-of-origin identified. The majority of phased DNMs were prezygotic (90%), the accuracy of which is highlighted by an average mutant allele frequency of 49.6% and standard error of 0.84%. This study demonstrates the benefits of employing an integrated short-read and long-read sequencing approach for large-scale DNM phasing.
Collapse
Affiliation(s)
- Giles S. Holt
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Lois E. Batty
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Bilal K. S. Alobaidi
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Hannah E. Smith
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Manon S. Oud
- Department of Human Genetics, Donders Institute for BrainCognition and Behaviour, RadboudumcNijmegenThe Netherlands
| | - Liliana Ramos
- Department of Obstetrics and Gynecology, Division of Reproductive MedicineRadboudumcNijmegenThe Netherlands
| | - Miguel J. Xavier
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Joris A. Veltman
- Biosciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| |
Collapse
|
45
|
Srikakulam N, Sridevi G, Pandi G. High-quality reference transcriptome construction improves RNA-seq quantification in Oryza sativa indica. Front Genet 2022; 13:995072. [PMID: 36246658 PMCID: PMC9558114 DOI: 10.3389/fgene.2022.995072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 09/02/2022] [Indexed: 11/13/2022] Open
Abstract
The Reference Transcriptomic Dataset (RTD) is an accurate and comprehensive collection of transcripts originating from a given organism. It holds the key to precise transcript quantification and downstream analysis of differential expressions and regulations. Currently, transcriptome annotations for most crop plants are far from complete. For example, Oryza sativa indica (O. sativa indica) is reported to have 40,759 transcripts in the Ensembl database without alternative transcript isoforms and alternative splicing (AS) events. To generate a high-quality RTD, we conducted RNA sequencing of rice leaf samples collected at various time points during Rhizoctonia solani infection. The obtained reads were analyzed by adopting the recently developed computational analysis pipeline to assemble the RTD with increased transcript and AS diversity for O. sativa indica (IndicaRTD). After stringent quality filtering, the newly constructed transcriptome annotation was comprised of 122,968 non-redundant transcripts from 53,695 genes. This study identified many novel transcripts compared to Ensembl deposited data that are important for regulating molecular and physiological processes in the plant system. Currently, the assembled IndicaRTD must allow fast quantification of transcript and gene expression with high precision.
Collapse
Affiliation(s)
- Nagesh Srikakulam
- Laboratory of RNA Biology and Epigenomics, Department of Plant Biotechnology, School of Biotechnology, Madurai Kamaraj University, Madurai, India
- *Correspondence: Nagesh Srikakulam, ; Gopal Pandi,
| | - Ganapathi Sridevi
- Department of Plant Biotechnology, School of Biotechnology, Madurai Kamaraj University, Madurai, India
| | - Gopal Pandi
- Laboratory of RNA Biology and Epigenomics, Department of Plant Biotechnology, School of Biotechnology, Madurai Kamaraj University, Madurai, India
- *Correspondence: Nagesh Srikakulam, ; Gopal Pandi,
| |
Collapse
|
46
|
Fu X, Zan XY, Sun L, Tan M, Cui FJ, Liang YY, Meng LJ, Sun WJ. Functional Characterization and Structural Basis of the β-1,3-Glucan Synthase CMGLS from Mushroom Cordyceps militaris. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2022; 70:8725-8737. [PMID: 35816703 DOI: 10.1021/acs.jafc.2c03410] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
β-1,3-Glucan synthases play key roles in glucan synthesis, cell wall assembly, and growth of fungi. However, their multi-transmembrane domains (over 14 TMHs) and large molecular masses (over 100 kDa) significantly hamper understanding of their catalytic characteristics and mechanisms. In the present study, the 5841-bp gene CMGLS encoding the 221.7 kDa membrane-bound β-1,3-glucan synthase CMGLS in Cordyceps militaris was cloned, identified, and structurally analyzed. CMGLS was partially purified with a specific activity of 87.72 pmol/min/μg, a purification fold of 121, and a yield of 10.16% using a product-entrapment purification method. CMGLS showed a strict specificity to UDP-glucose with a Km value of 84.28 μM at pH 7.0 and synthesized β-1,3-glucan with a maximum degree of polymerization (DP) of 70. With the assistance of AlphaFold and molecular docking, the 3D structure of CMGLS and its binding features with substrate UDP-glucose were proposed for the first time to our knowledge. UDP-glucose potentially bound to at least 11 residues via hydrogen bonds, π-stacking ,and salt bridges, and Arg 1436 was predicted as a key residue directly interacting with the moieties of glucose, phosphate, and the ribose ring on UDP-glucose. These findings would open an avenue to recognize and understand the glucan synthesis process and catalytic mechanism of β-1,3-glucan synthases in mushrooms.
Collapse
Affiliation(s)
- Xin Fu
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
| | - Xin-Yi Zan
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
| | - Lei Sun
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
| | - Ming Tan
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
| | - Feng-Jie Cui
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
- Jiangxi Provincial Engineering and Technology Center for Food Additives Bio-production, Dexing 334221, P.R. China
| | - Ying-Ying Liang
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
| | - Li-Juan Meng
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
| | - Wen-Jing Sun
- School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, P.R. China
- Jiangxi Provincial Engineering and Technology Center for Food Additives Bio-production, Dexing 334221, P.R. China
| |
Collapse
|
47
|
Zhang R, Kuo R, Coulter M, Calixto CPG, Entizne JC, Guo W, Marquez Y, Milne L, Riegler S, Matsui A, Tanaka M, Harvey S, Gao Y, Wießner-Kroh T, Paniagua A, Crespi M, Denby K, Hur AB, Huq E, Jantsch M, Jarmolowski A, Koester T, Laubinger S, Li QQ, Gu L, Seki M, Staiger D, Sunkar R, Szweykowska-Kulinska Z, Tu SL, Wachter A, Waugh R, Xiong L, Zhang XN, Conesa A, Reddy ASN, Barta A, Kalyna M, Brown JWS. A high-resolution single-molecule sequencing-based Arabidopsis transcriptome using novel methods of Iso-seq analysis. Genome Biol 2022; 23:149. [PMID: 35799267 PMCID: PMC9264592 DOI: 10.1186/s13059-022-02711-0] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 06/15/2022] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis. RESULTS We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts-twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage. CONCLUSIONS AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.
Collapse
Affiliation(s)
- Runxuan Zhang
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK.
| | - Richard Kuo
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, EH25 9RG, UK
| | - Max Coulter
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
| | - Cristiane P G Calixto
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
- Present address: Institute of Biosciences, University of São Paulo, São Paulo, 05508-090, Brazil
| | - Juan Carlos Entizne
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
| | - Wenbin Guo
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| | - Yamile Marquez
- Centre for Genomic Regulation, C/ Dr. Aiguader 88, 08003, Barcelona, Spain
| | - Linda Milne
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| | - Stefan Riegler
- Institute of Molecular Plant Biology, Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190, Vienna, Austria
- Present address: Institute of Science and Technology Austria, Am Campus 1, 3400, Klosterneuburg, Austria
| | - Akihiro Matsui
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Maho Tanaka
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Sarah Harvey
- Centre for Novel Agricultural Products (CNAP), Department of Biology, University of York Wentworth Way, York, YO10 5DD, UK
| | - Yubang Gao
- College of Forestry, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Theresa Wießner-Kroh
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
| | - Alejandro Paniagua
- Institute for Integrative Systems Biology (CSIC-UV), Spanish National Research Council, Paterna, Valencia, Spain
| | - Martin Crespi
- French National Centre for Scientific Research | CNRS INRAE-Universities of Paris Saclay and Paris, Institute of Plant Sciences Paris Saclay IPS2, Rue de Noetzlin, 91192, Gif sur Yvette, France
| | - Katherine Denby
- Centre for Novel Agricultural Products (CNAP), Department of Biology, University of York Wentworth Way, York, YO10 5DD, UK
| | - Asa Ben Hur
- Department of Computer Science, Colorado State University, 1873 Campus Delivery, Fort Collins, CO, 80523-1873, USA
| | - Enamul Huq
- Department of Molecular Biosciences, University of Texas at Austin, 100 East 24th St., Austin, TX, 78712-1095, USA
| | - Michael Jantsch
- Department of Cell and Developmental Biology, Center for Anatomy and Cell Biology, Medical University of Vienna, Schwarzspanierstrasse 17 A-1090, Vienna, Austria
| | - Artur Jarmolowski
- Department of Gene Expression, Adam Mickiewicz University, Poznań, Poland
| | - Tino Koester
- RNA Biology and Molecular Physiology, Faculty for Biology, Bielefeld University, Universitaetsstrasse 25, 33615, Bielefeld, Germany
| | - Sascha Laubinger
- Institut für Biologie und Umweltwissenschaften (IBU), Carl von Ossietzky Universität Oldenburg, Carl von Ossietzky-Str. 9-11, 26111, Oldenburg, Germany
- Institute of Biology, Department of Genetics, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Qingshun Quinn Li
- Graduate College of Biomedical Sciences, Western University of Health Sciences, Pomona, CA, 91766, USA
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen, 361102, Fujian, China
| | - Lianfeng Gu
- College of Forestry, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Motoaki Seki
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Dorothee Staiger
- RNA Biology and Molecular Physiology, Faculty for Biology, Bielefeld University, Universitaetsstrasse 25, 33615, Bielefeld, Germany
| | - Ramanjulu Sunkar
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK, 74078, USA
| | | | - Shih-Long Tu
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan
| | - Andreas Wachter
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
- Present address: Institute for Molecular Physiology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 17, 55128, Mainz, Germany
| | - Robbie Waugh
- Cell and Molecular Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| | - Liming Xiong
- Department of Biology, Hong Kong Baptist University, Hong Kong, China
| | - Xiao-Ning Zhang
- Biology Department, School of Arts and Sciences, St. Bonaventure University, 3261 West State Road, St. Bonaventure, NY, 14778, USA
| | - Ana Conesa
- Institute for Integrative Systems Biology (CSIC-UV), Spanish National Research Council, Paterna, Valencia, Spain
| | - Anireddy S N Reddy
- Department of Biology and Program in Cell and Molecular Biology, Colorado State University, Fort Collins, CO, 80523, USA
| | - Andrea Barta
- Max F. Perutz Laboratories, Medical University of Vienna, Center of Medical Biochemistry, Dr.-Bohr-Gasse 9/3, A-1030, Vienna, Austria
| | - Maria Kalyna
- Institute of Molecular Plant Biology, Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190, Vienna, Austria
| | - John W S Brown
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
- Cell and Molecular Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| |
Collapse
|
48
|
Sereika M, Kirkegaard RH, Karst SM, Michaelsen TY, Sørensen EA, Wollenberg RD, Albertsen M. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat Methods 2022; 19:823-826. [PMID: 35789207 PMCID: PMC9262707 DOI: 10.1038/s41592-022-01539-7] [Citation(s) in RCA: 223] [Impact Index Per Article: 74.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Accepted: 05/24/2022] [Indexed: 12/26/2022]
Abstract
Long-read Oxford Nanopore sequencing has democratized microbial genome sequencing and enables the recovery of highly contiguous microbial genomes from isolates or metagenomes. However, to obtain near-finished genomes it has been necessary to include short-read polishing to correct insertions and deletions derived from homopolymer regions. Here, we show that Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing.
Collapse
Affiliation(s)
- Mantas Sereika
- Center for Microbial Communities, Aalborg University, Aalborg, Denmark
| | - Rasmus Hansen Kirkegaard
- Center for Microbial Communities, Aalborg University, Aalborg, Denmark.,Joint Microbiome Facility, University of Vienna, Vienna, Austria
| | | | | | | | | | - Mads Albertsen
- Center for Microbial Communities, Aalborg University, Aalborg, Denmark.
| |
Collapse
|
49
|
Arévalo MT, Karavis MA, Katoski SE, Harris JV, Hill JM, Deshpande SV, Roth PA, Liem AT, Bernhards RC. A Rapid, Whole Genome Sequencing Assay for Detection and Characterization of Novel Coronavirus (SARS-CoV-2) Clinical Specimens Using Nanopore Sequencing. Front Microbiol 2022; 13:910955. [PMID: 35733956 PMCID: PMC9207459 DOI: 10.3389/fmicb.2022.910955] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/09/2022] [Indexed: 12/22/2022] Open
Abstract
A new human coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), emerged at the end of 2019 in Wuhan, China that caused a range of disease severities; including fever, shortness of breath, and coughing. This disease, now known as coronavirus disease 2019 (COVID-19), quickly spread throughout the world, and was declared a pandemic by the World Health Organization in March of 2020. As the disease continues to spread, providing rapid characterization has proven crucial to better inform the design and execution of control measures, such as decontamination methods, diagnostic tests, antiviral drugs, and prophylactic vaccines for long-term control. Our work at the United States Army’s Combat Capabilities Development Command Chemical Biological Center (DEVCOM CBC) is focused on engineering workflows to efficiently identify, characterize, and evaluate the threat level of any potential biological threat in the field and more remote, lower resource settings, such as forward operating bases. While we have successfully established untargeted sequencing approaches for detection of pathogens for rapid identification, our current work entails a more in-depth sequencing analysis for use in evolutionary monitoring. We are developing and validating a SARS-CoV-2 nanopore sequencing assay, based on the ARTIC protocol. The standard ARTIC, Illumina, and nanopore sequencing protocols for SARS-CoV-2 are elaborate and time consuming. The new protocol integrates Oxford Nanopore Technology’s Rapid Sequencing Kit following targeted RT-PCR of RNA extracted from human clinical specimens. This approach decreases sample manipulations and preparation times. Our current bioinformatics pipeline utilizes Centrifuge as the classifier for quick identification of SARS-CoV-2 and RAMPART software for verification and mapping of reads to the full SARS-CoV-2 genome. ARTIC rapid sequencing results, of previous RT-PCR confirmed patient samples, showed that the modified protocol produces high quality data, with up to 98.9% genome coverage at >1,000x depth for samples with presumably higher viral loads. Furthermore, whole genome assembly and subsequent mutational analysis of six of these sequences identified existing and unique mutations to this cluster, including three in the Spike protein: V308L, P521R, and D614G. This work suggests that an accessible, portable, and relatively fast sample-to-sequence process to characterize viral outbreaks is feasible and effective.
Collapse
Affiliation(s)
- Maria T. Arévalo
- Defense Threat Reduction Agency, Aberdeen Proving Ground, MD, United States
- United States Army Combat Capabilities Development Command Chemical Biological Center, Aberdeen Proving Ground, MD, United States
- *Correspondence: Maria T. Arévalo,
| | - Mark A. Karavis
- United States Army Combat Capabilities Development Command Chemical Biological Center, Aberdeen Proving Ground, MD, United States
| | - Sarah E. Katoski
- United States Army Combat Capabilities Development Command Chemical Biological Center, Aberdeen Proving Ground, MD, United States
| | - Jacquelyn V. Harris
- United States Army Combat Capabilities Development Command Chemical Biological Center, Aberdeen Proving Ground, MD, United States
| | | | - Samir V. Deshpande
- United States Army Combat Capabilities Development Command Chemical Biological Center, Aberdeen Proving Ground, MD, United States
| | | | | | - R. Cory Bernhards
- United States Army Combat Capabilities Development Command Chemical Biological Center, Aberdeen Proving Ground, MD, United States
| |
Collapse
|
50
|
Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, Howe K, Jain C, Koren S, Logsdon GA, Miga KH, Mikheenko A, Paten B, Shumate A, Soto DC, Sović I, Wood JMD, Zook JM, Phillippy AM, Rhie A. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 2022; 19:687-695. [PMID: 35361931 PMCID: PMC9812399 DOI: 10.1038/s41592-022-01440-3] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/04/2022] [Indexed: 01/07/2023]
Abstract
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
Collapse
Affiliation(s)
- Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | | | | | - Chirag Jain
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Daniela C Soto
- Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
| | - Ivan Sović
- Pacific Biosciences, Menlo Park, CA, USA
- Digital BioLogic d.o.o., Ivanić-Grad, Croatia
| | | | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
| |
Collapse
|