1
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
2
|
Fernandez LA, Martin-Mayor V, Yllanes D. Phase transition in the computational complexity of the shortest common superstring and genome assembly. Phys Rev E 2024; 109:014133. [PMID: 38366408 DOI: 10.1103/physreve.109.014133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Accepted: 12/11/2023] [Indexed: 02/18/2024]
Abstract
Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the "easy" phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.
Collapse
Affiliation(s)
- L A Fernandez
- Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain
- Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain
| | - V Martin-Mayor
- Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain
- Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain
| | - D Yllanes
- Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain
- Chan Zuckerberg Biohub - SF, 499 Illinois Street, San Francisco, California 94158, USA
| |
Collapse
|
3
|
Nestor BJ, Bayer PE, Fernandez CGT, Edwards D, Finnegan PM. Approaches to increase the validity of gene family identification using manual homology search tools. Genetica 2023; 151:325-338. [PMID: 37817002 PMCID: PMC10692271 DOI: 10.1007/s10709-023-00196-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/01/2023] [Indexed: 10/12/2023]
Abstract
Identifying homologs is an important process in the analysis of genetic patterns underlying traits and evolutionary relationships among species. Analysis of gene families is often used to form and support hypotheses on genetic patterns such as gene presence, absence, or functional divergence which underlie traits examined in functional studies. These analyses often require precise identification of all members in a targeted gene family. Manual pipelines where homology search and orthology assignment tools are used separately are the most common approach for identifying small gene families where accurate identification of all members is important. The ability to curate sequences between steps in manual pipelines allows for simple and precise identification of all possible gene family members. However, the validity of such manual pipeline analyses is often decreased by inappropriate approaches to homology searches including too relaxed or stringent statistical thresholds, inappropriate query sequences, homology classification based on sequence similarity alone, and low-quality proteome or genome sequences. In this article, we propose several approaches to mitigate these issues and allow for precise identification of gene family members and support for hypotheses linking genetic patterns to functional traits.
Collapse
Affiliation(s)
- Benjamin J Nestor
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia.
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia.
| | - Philipp E Bayer
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - Cassandria G Tay Fernandez
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - David Edwards
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - Patrick M Finnegan
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| |
Collapse
|
4
|
Gryganskyi AP, Golan J, Muszewska A, Idnurm A, Dolatabadi S, Mondo SJ, Kutovenko VB, Kutovenko VO, Gajdeczka MT, Anishchenko IM, Pawlowska J, Tran NV, Ebersberger I, Voigt K, Wang Y, Chang Y, Pawlowska TE, Heitman J, Vilgalys R, Bonito G, Benny GL, Smith ME, Reynolds N, James TY, Grigoriev IV, Spatafora JW, Stajich JE. Sequencing the Genomes of the First Terrestrial Fungal Lineages: What Have We Learned? Microorganisms 2023; 11:1830. [PMID: 37513002 PMCID: PMC10386755 DOI: 10.3390/microorganisms11071830] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/13/2023] [Accepted: 07/16/2023] [Indexed: 07/30/2023] Open
Abstract
The first genome sequenced of a eukaryotic organism was for Saccharomyces cerevisiae, as reported in 1996, but it was more than 10 years before any of the zygomycete fungi, which are the early-diverging terrestrial fungi currently placed in the phyla Mucoromycota and Zoopagomycota, were sequenced. The genome for Rhizopus delemar was completed in 2008; currently, more than 1000 zygomycete genomes have been sequenced. Genomic data from these early-diverging terrestrial fungi revealed deep phylogenetic separation of the two major clades-primarily plant-associated saprotrophic and mycorrhizal Mucoromycota versus the primarily mycoparasitic or animal-associated parasites and commensals in the Zoopagomycota. Genomic studies provide many valuable insights into how these fungi evolved in response to the challenges of living on land, including adaptations to sensing light and gravity, development of hyphal growth, and co-existence with the first terrestrial plants. Genome sequence data have facilitated studies of genome architecture, including a history of genome duplications and horizontal gene transfer events, distribution and organization of mating type loci, rDNA genes and transposable elements, methylation processes, and genes useful for various industrial applications. Pathogenicity genes and specialized secondary metabolites have also been detected in soil saprobes and pathogenic fungi. Novel endosymbiotic bacteria and viruses have been discovered during several zygomycete genome projects. Overall, genomic information has helped to resolve a plethora of research questions, from the placement of zygomycetes on the evolutionary tree of life and in natural ecosystems, to the applied biotechnological and medical questions.
Collapse
Affiliation(s)
- Andrii P. Gryganskyi
- Division of Biological & Nanoscale Technologies, UES, Inc., Dayton, OH 45432, USA
| | - Jacob Golan
- Department of Botany, University of Wisconsin-Madison, Madison, WI 53706, USA;
| | - Anna Muszewska
- Institute of Biochemistry & Biophysics, Polish Academy of Sciences, 01-224 Warsaw, Poland;
| | - Alexander Idnurm
- School of BioSciences, University of Melbourne, Parkville, VIC 3010, Australia;
| | - Somayeh Dolatabadi
- Biology Department, Hakim Sabzevari University, Sabzevar 96179-76487, Iran;
| | - Stephen J. Mondo
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; (S.J.M.); (I.V.G.)
| | - Vira B. Kutovenko
- Department of Agrobiology, National University of Life & Environmental Sciences, 03041 Kyiv, Ukraine; (V.B.K.)
| | - Volodymyr O. Kutovenko
- Department of Agrobiology, National University of Life & Environmental Sciences, 03041 Kyiv, Ukraine; (V.B.K.)
| | | | - Iryna M. Anishchenko
- MG Kholodny Institute of Botany, National Academy of Sciences, 01030 Kyiv, Ukraine;
| | - Julia Pawlowska
- Institute of Evolutionary Biology, Faculty of Biology, Biological & Chemical Research Centre, University of Warsaw, 02-089 Warsaw, Poland;
| | - Ngoc Vinh Tran
- Plant Pathology Department, University of Florida, Gainesville, FL 32611, USA; (N.V.T.); (G.L.B.); (M.E.S.)
| | - Ingo Ebersberger
- Leibniz Institute for Natural Product Research & Infection Biology, 07745 Jena, Germany; (I.E.); (K.V.)
| | - Kerstin Voigt
- Leibniz Institute for Natural Product Research & Infection Biology, 07745 Jena, Germany; (I.E.); (K.V.)
| | - Yan Wang
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, ON M5S 1A1, Canada;
- Department of Biological Sciences, University of Toronto Scarborough, Toronto, ON M1C 1A4, Canada
| | - Ying Chang
- Department of Biological Sciences, National University of Singapore, Singapore 119077, Singapore;
| | - Teresa E. Pawlowska
- School of Integrative Plant Science, Cornell University, Ithaca, NY 14850, USA; (T.E.P.); (N.R.)
| | - Joseph Heitman
- Department of Molecular Genetics & Microbiology, Duke University School of Medicine, Durham, NC 27710, USA;
| | - Rytas Vilgalys
- Biology Department, Duke University, Durham, NC 27708, USA;
| | - Gregory Bonito
- Department of Plant, Soil & Microbial Sciences, Michigan State University, East Lansing, MI 48824, USA;
| | - Gerald L. Benny
- Plant Pathology Department, University of Florida, Gainesville, FL 32611, USA; (N.V.T.); (G.L.B.); (M.E.S.)
| | - Matthew E. Smith
- Plant Pathology Department, University of Florida, Gainesville, FL 32611, USA; (N.V.T.); (G.L.B.); (M.E.S.)
| | - Nicole Reynolds
- School of Integrative Plant Science, Cornell University, Ithaca, NY 14850, USA; (T.E.P.); (N.R.)
| | - Timothy Y. James
- Department of Ecology & Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Igor V. Grigoriev
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; (S.J.M.); (I.V.G.)
- Department of Plant & Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Joseph W. Spatafora
- Department of Botany & Plant Pathology, Oregon State University, Corvallis, OR 97331, USA;
| | - Jason E. Stajich
- Department of Microbiology & Plant Pathology, University of California, Riverside, CA 93106, USA;
| |
Collapse
|
5
|
Darnet E, Teixeira B, Schaller H, Rogez H, Darnet S. Elucidating the Mesocarp Drupe Transcriptome of Açai ( Euterpe oleracea Mart.): An Amazonian Tree Palm Producer of Bioactive Compounds. Int J Mol Sci 2023; 24:ijms24119315. [PMID: 37298279 DOI: 10.3390/ijms24119315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 05/13/2023] [Accepted: 05/16/2023] [Indexed: 06/12/2023] Open
Abstract
Euterpe oleracea palm, endemic to the Amazon region, is well known for açai, a fruit violet beverage with nutritional and medicinal properties. During E. oleracea fruit ripening, anthocyanin accumulation is not related to sugar production, contrarily to grape and blueberry. Ripened fruits have a high content of anthocyanins, isoprenoids, fibers, and proteins, and are poor in sugars. E. oleracea is proposed as a new genetic model for metabolism partitioning in the fruit. Approximately 255 million single-end-oriented reads were generated on an Ion Proton NGS platform combining fruit cDNA libraries at four ripening stages. The de novo transcriptome assembly was tested using six assemblers and 46 different combinations of parameters, a pre-processing and a post-processing step. The multiple k-mer approach with TransABySS as an assembler and Evidential Gene as a post-processer have shown the best results, with an N50 of 959 bp, a read coverage mean of 70x, a BUSCO complete sequence recovery of 36% and an RBMT of 61%. The fruit transcriptome dataset included 22,486 transcripts representing 18 Mbp, of which a proportion of 87% had significant homology with other plant sequences. Approximately 904 new EST-SSRs were described, and were common and transferable to Phoenix dactylifera and Elaeis guineensis, two other palm trees. The global GO classification of transcripts showed similar categories to that in P. dactylifera and E. guineensis fruit transcriptomes. For an accurate annotation and functional description of metabolism genes, a bioinformatic pipeline was developed to precisely identify orthologs, such as one-to-one orthologs between species, and to infer multigenic family evolution. The phylogenetic inference confirmed an occurrence of duplication events in the Arecaceae lineage and the presence of orphan genes in E. oleracea. Anthocyanin and tocopherol pathways were annotated entirely. Interestingly, the anthocyanin pathway showed a high number of paralogs, similar to in grape, whereas the tocopherol pathway exhibited a low and conserved gene number and the prediction of several splicing forms. The release of this exhaustively annotated molecular dataset of E. oleracea constitutes a valuable tool for further studies in metabolism partitioning and opens new great perspectives to study fruit physiology with açai as a model.
Collapse
Affiliation(s)
- Elaine Darnet
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
- International Associated Laboratory PALMHEAT, Frech Scientific Research National Center (CNRS)/UFPA, 75016 Paris, France
| | - Bruno Teixeira
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
| | - Hubert Schaller
- International Associated Laboratory PALMHEAT, Frech Scientific Research National Center (CNRS)/UFPA, 75016 Paris, France
- Plant Isoprenoid Biology, Institute of Molecular Biology of Plants of the Scientific Research National Center, Strasbourg University, 67081 Strasbourg, France
| | - Hervé Rogez
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
| | - Sylvain Darnet
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
- International Associated Laboratory PALMHEAT, Frech Scientific Research National Center (CNRS)/UFPA, 75016 Paris, France
- Plant Isoprenoid Biology, Institute of Molecular Biology of Plants of the Scientific Research National Center, Strasbourg University, 67081 Strasbourg, France
| |
Collapse
|
6
|
Chen P, Sun Z, Wang J, Liu X, Bai Y, Chen J, Liu A, Qiao F, Chen Y, Yuan C, Sha J, Zhang J, Xu LQ, Li J. Portable nanopore-sequencing technology: Trends in development and applications. Front Microbiol 2023; 14:1043967. [PMID: 36819021 PMCID: PMC9929578 DOI: 10.3389/fmicb.2023.1043967] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 01/03/2023] [Indexed: 02/04/2023] Open
Abstract
Sequencing technology is the most commonly used technology in molecular biology research and an essential pillar for the development and applications of molecular biology. Since 1977, when the first generation of sequencing technology opened the door to interpreting the genetic code, sequencing technology has been developing for three generations. It has applications in all aspects of life and scientific research, such as disease diagnosis, drug target discovery, pathological research, species protection, and SARS-CoV-2 detection. However, the first- and second-generation sequencing technology relied on fluorescence detection systems and DNA polymerization enzyme systems, which increased the cost of sequencing technology and limited its scope of applications. The third-generation sequencing technology performs PCR-free and single-molecule sequencing, but it still depends on the fluorescence detection device. To break through these limitations, researchers have made arduous efforts to develop a new advanced portable sequencing technology represented by nanopore sequencing. Nanopore technology has the advantages of small size and convenient portability, independent of biochemical reagents, and direct reading using physical methods. This paper reviews the research and development process of nanopore sequencing technology (NST) from the laboratory to commercially viable tools; discusses the main types of nanopore sequencing technologies and their various applications in solving a wide range of real-world problems. In addition, the paper collates the analysis tools necessary for performing different processing tasks in nanopore sequencing. Finally, we highlight the challenges of NST and its future research and application directions.
Collapse
Affiliation(s)
- Pin Chen
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Zepeng Sun
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China
| | - Jiawei Wang
- School of Computer Science and Technology, Southeast University, Nanjing, China
| | - Xinlong Liu
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China
| | - Yun Bai
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Jiang Chen
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Anna Liu
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Feng Qiao
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China
| | - Yang Chen
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Chenyan Yuan
- Clinical Laboratory, Southeast University Zhongda Hospital, Nanjing, China
| | - Jingjie Sha
- School of Mechanical Engineering, Southeast University, Nanjing, China
| | - Jinghui Zhang
- School of Computer Science and Technology, Southeast University, Nanjing, China
| | - Li-Qun Xu
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China,*Correspondence: Li-Qun Xu, ✉
| | - Jian Li
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China,Jian Li, ✉
| |
Collapse
|
7
|
Mitogenome of a monotypic genus, Oliotius Kottelat, 2013 (Cypriniformes: Cyprinidae): Genomic characterization and phylogenetic position. Gene 2022; 851:147035. [DOI: 10.1016/j.gene.2022.147035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 11/01/2022] [Accepted: 11/04/2022] [Indexed: 11/10/2022]
|
8
|
Liu SC, Ju YR, Lu CL. Multi-CSAR: a web server for scaffolding contigs using multiple reference genomes. Nucleic Acids Res 2022; 50:W500-W509. [PMID: 35524553 PMCID: PMC9252826 DOI: 10.1093/nar/gkac301] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 04/09/2022] [Accepted: 04/15/2022] [Indexed: 11/12/2022] Open
Abstract
Multi-CSAR is a web server that can efficiently and more accurately order and orient the contigs in the assembly of a target genome into larger scaffolds based on multiple reference genomes. Given a target genome and multiple reference genomes, Multi-CSAR first identifies sequence markers shared between the target genome and each reference genome, then utilizes these sequence markers to compute a scaffold for the target genome based on each single reference genome, and finally combines all the single reference-derived scaffolds into a multiple reference-derived scaffold. To run Multi-CSAR, the users need to upload a target genome to be scaffolded and one or more reference genomes in multi-FASTA format. The users can also choose to use the ‘weighting scheme of reference genomes’ for Multi-CSAR to automatically calculate different weights for the reference genomes and choose either ‘NUCmer on nucleotides’ or ‘PROmer on translated amino acids’ for Multi-CSAR to identify sequence markers. In the output page, Multi-CSAR displays its multiple reference-derived scaffold in two graphical representations (i.e. Circos plot and dotplot) for the users to visually validate the correctness of scaffolded contigs and in a tabular representation to further validate the scaffold in detail. Multi-CSAR is available online at http://genome.cs.nthu.edu.tw/Multi-CSAR/.
Collapse
Affiliation(s)
- Shu-Cheng Liu
- Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan
| | - Yan-Ru Ju
- Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan
| | - Chin Lung Lu
- Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan
| |
Collapse
|
9
|
Liu F, Miao Y, Liu Y, Hou T. RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1840-1849. [PMID: 33315571 DOI: 10.1109/tcbb.2020.3044575] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Viruses are the most abundant biological entities on earth, and play vital roles in many aspects of microbial communities. As major human pathogens, viruses have caused huge mortality and morbidity to human society in history. Metagenomic sequencing methods could capture all microorganisms from microbiota, with sequences of viruses mixed with these of other species. Therefore, it is necessary to identify viral sequences from metagenomes. However, existing methods perform poorly on identifying short viral sequences. To solve this problem, a deep learning based method, RNN-VirSeeker, is proposed in this paper. RNN-VirSeeker was trained by sequences of 500bp sampled from known Virus and Host RefSeq genomes. Experimental results on the testing set have shown that RNN-VirSeeker exhibited AUROC of 0.9175, recall of 0.8640 and precision of 0.9211 for sequences of 500bp, and outperformed three widely used methods, VirSorter, VirFinder, and DeepVirFinder, on identifying short viral sequences. RNN-VirSeeker was also used to identify viral sequences from a CAMI dataset and a human gut metagenome. Compared with DeepVirFinder, RNN-VirSeeker identified more viral sequences from these metagenomes and achieved greater values of AUPRC and AUROC. RNN-VirSeeker is freely available at https://github.com/crazyinter/RNN-VirSeeker.
Collapse
|
10
|
|
11
|
Islam MT, Alam ARU, Sakib N, Hasan MS, Chakrovarty T, Tawyabur M, Islam OK, Al-Emran HM, Jahid MIK, Anwar Hossain M. A rapid and cost-effective multiplex ARMS-PCR method for the simultaneous genotyping of the circulating SARS-CoV-2 phylogenetic clades. J Med Virol 2021; 93:2962-2970. [PMID: 33491822 DOI: 10.1101/2020.10.08.20209692] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Revised: 01/15/2021] [Accepted: 01/20/2021] [Indexed: 05/23/2023]
Abstract
Tracing the globally circulating severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) phylogenetic clades by high-throughput sequencing is costly, time-consuming, and labor-intensive. We here propose a rapid, simple, and cost-effective amplification refractory mutation system (ARMS)-based multiplex reverse-transcription polymerase chain reaction (PCR) assay to identify six distinct phylogenetic clades: S, L, V, G, GH, and GR. Our multiplex PCR is designed in a mutually exclusive way to identify V-S and G-GH-GR clade variants separately. The pentaplex assay included all five variants and the quadruplex comprised of the triplex variants alongside either V or S clade mutations that created two separate subsets. The procedure was optimized with 0.2-0.6 µM primer concentration, 56-60°C annealing temperature, and 3-5 ng/µl complementary DNA to validate on 24 COVID-19-positive samples. Targeted Sanger sequencing further confirmed the presence of the clade-featured mutations with another set of primers. This multiplex ARMS-PCR assay is a fast, low-cost alternative and convenient to discriminate the circulating phylogenetic clades of SARS-CoV-2.
Collapse
Affiliation(s)
- Mohammad Tanvir Islam
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Asm Rubayet Ul Alam
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Najmuj Sakib
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Mohammad Shazid Hasan
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Tanay Chakrovarty
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Mohammad Tawyabur
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Ovinu Kibria Islam
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Hassan M Al-Emran
- Department of Biomedical Engineering, Jashore University of Science and Technology, Jashore, Bangladesh
| | | | - Mohammad Anwar Hossain
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
- Department of Microbiology, University of Dhaka, Dhaka, Bangladesh
| |
Collapse
|
12
|
Yinda CK, Seifert SN, Macmenamin P, van Doremalen N, Kim L, Bushmaker T, de Wit E, Quinones M, Munster VJ. A Novel Field-Deployable Method for Sequencing and Analyses of Henipavirus Genomes From Complex Samples on the MinION Platform. J Infect Dis 2021; 221:S383-S388. [PMID: 31784761 DOI: 10.1093/infdis/jiz576] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Viruses in the genus Henipavirus encompass 2 highly pathogenic emerging zoonotic pathogens, Hendra virus (HeV) and Nipah virus (NiV). Despite the impact on human health, there is currently limited full-genome sequence information available for henipaviruses. This lack of full-length genomes hampers our ability to understand the molecular drivers of henipavirus emergence. Furthermore, rapidly deployable viral genome sequencing can be an integral part of outbreak response and epidemiological investigations to study transmission chains. In this study, we describe the development of a reverse-transcription, long-range polymerase chain reaction (LRPCR) assay for efficient genome amplification of NiV, HeV, and a related non-pathogenic henipavirus, Cedar virus (CedPV). We then demonstrated the utility of our method by amplifying partial viral genomes from 6 HeV-infected tissue samples from Syrian hamsters and 4 tissue samples from a NiV-infected African green monkey with viral loads as low as 52 genome copies/mg. We subsequently sequenced the amplified genomes on the portable Oxford Nanopore MinION platform and analyzed the data using a newly developed field-deployable bioinformatic pipeline. Our LRPCR assay allows amplification and sequencing of 2 or 4 amplicons in semi-nested reactions. Coupled with an easy-to-use bioinformatics pipeline, this method is particularly useful in the field during outbreaks in resource-poor environments.
Collapse
Affiliation(s)
- Claude Kwe Yinda
- Laboratory of Virology, Rocky Mountain Laboratories, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, Montana, USA
| | - Stephanie N Seifert
- Laboratory of Virology, Rocky Mountain Laboratories, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, Montana, USA
| | - Philip Macmenamin
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - Neeltje van Doremalen
- Laboratory of Virology, Rocky Mountain Laboratories, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, Montana, USA
| | - Lewis Kim
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - Trenton Bushmaker
- Laboratory of Virology, Rocky Mountain Laboratories, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, Montana, USA
| | - Emmie de Wit
- Laboratory of Virology, Rocky Mountain Laboratories, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, Montana, USA
| | - Mariam Quinones
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - Vincent J Munster
- Laboratory of Virology, Rocky Mountain Laboratories, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, Montana, USA
| |
Collapse
|
13
|
Islam MT, Alam ARU, Sakib N, Hasan MS, Chakrovarty T, Tawyabur M, Islam OK, Al-Emran HM, Jahid MIK, Anwar Hossain M. A rapid and cost-effective multiplex ARMS-PCR method for the simultaneous genotyping of the circulating SARS-CoV-2 phylogenetic clades. J Med Virol 2021; 93:2962-2970. [PMID: 33491822 PMCID: PMC8014803 DOI: 10.1002/jmv.26818] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Revised: 01/15/2021] [Accepted: 01/20/2021] [Indexed: 01/09/2023]
Abstract
Tracing the globally circulating severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) phylogenetic clades by high‐throughput sequencing is costly, time‐consuming, and labor‐intensive. We here propose a rapid, simple, and cost‐effective amplification refractory mutation system (ARMS)‐based multiplex reverse‐transcription polymerase chain reaction (PCR) assay to identify six distinct phylogenetic clades: S, L, V, G, GH, and GR. Our multiplex PCR is designed in a mutually exclusive way to identify V–S and G–GH–GR clade variants separately. The pentaplex assay included all five variants and the quadruplex comprised of the triplex variants alongside either V or S clade mutations that created two separate subsets. The procedure was optimized with 0.2–0.6 µM primer concentration, 56–60°C annealing temperature, and 3–5 ng/µl complementary DNA to validate on 24 COVID‐19‐positive samples. Targeted Sanger sequencing further confirmed the presence of the clade‐featured mutations with another set of primers. This multiplex ARMS‐PCR assay is a fast, low‐cost alternative and convenient to discriminate the circulating phylogenetic clades of SARS‐CoV‐2.
Multiplex ARMS‐PCR (amplification refractory mutation system‐polymerase chain reaction) method for genotyping major severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2 clades). Identify the mutated region of circulating phylogenetically SARS‐CoV‐2 clades. PCR conditions were optimized and validated to identify V–S and G–GH–GR clade.
Collapse
Affiliation(s)
- Mohammad Tanvir Islam
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Asm Rubayet Ul Alam
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Najmuj Sakib
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Mohammad Shazid Hasan
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Tanay Chakrovarty
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Mohammad Tawyabur
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Ovinu Kibria Islam
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Hassan M Al-Emran
- Department of Biomedical Engineering, Jashore University of Science and Technology, Jashore, Bangladesh
| | | | - Mohammad Anwar Hossain
- Department of Microbiology, Jashore University of Science and Technology, Jashore, Bangladesh.,Department of Microbiology, University of Dhaka, Dhaka, Bangladesh
| |
Collapse
|
14
|
|
15
|
Mboowa G, Sserwadda I, Aruhomukama D. Genomics and bioinformatics capacity in Africa: no continent is left behind. Genome 2021; 64:503-513. [PMID: 33433259 DOI: 10.1139/gen-2020-0013] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Despite the poor genomics research capacity in Africa, efforts have been made to empower African scientists to get involved in genomics research, particularly that involving African populations. As part of the Human Heredity and Health in Africa (H3Africa) Consortium, an initiative was set to make genomics research in Africa an African endeavor and was developed through funding from the United States' National Institutes of Health Common Fund and the Wellcome Trust. H3Africa is intended to encourage a contemporary research approach by African investigators and to stimulate the study of genomic and environmental determinants of common diseases. The goal of these endeavors is to improve the health of African populations. To build capacity for bioinformatics and genomics research, organizations such as the African Society for Bioinformatics and Computational Biology have been established. In this article, we discuss the current status of the bioinformatics infrastructure in Africa as well as the training challenges and opportunities.
Collapse
Affiliation(s)
- Gerald Mboowa
- Department of Immunology and Molecular Biology, College of Health Sciences, Makerere University, Uganda, P.O. Box 7072, Kampala, Uganda.,Department of Medical Microbiology, School of Biomedical Sciences, College of Health Sciences, Makerere University, P.O. Box 7072, Kampala, Uganda.,The African Center of Excellence in Bioinformatics and Data-Intensive Sciences, Infectious Disease Institute, Makerere University, P.O. Box 22418, Kampala, Uganda
| | - Ivan Sserwadda
- Department of Immunology and Molecular Biology, College of Health Sciences, Makerere University, Uganda, P.O. Box 7072, Kampala, Uganda
| | - Dickson Aruhomukama
- Department of Medical Microbiology, School of Biomedical Sciences, College of Health Sciences, Makerere University, P.O. Box 7072, Kampala, Uganda
| |
Collapse
|
16
|
Computational Genomics. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
|
17
|
Persi E, Wolf YI, Horn D, Ruppin E, Demichelis F, Gatenby RA, Gillies RJ, Koonin EV. Mutation-selection balance and compensatory mechanisms in tumour evolution. Nat Rev Genet 2020; 22:251-262. [PMID: 33257848 DOI: 10.1038/s41576-020-00299-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/16/2020] [Indexed: 12/11/2022]
Abstract
Intratumour heterogeneity and phenotypic plasticity, sustained by a range of somatic aberrations, as well as epigenetic and metabolic adaptations, are the principal mechanisms that enable cancers to resist treatment and survive under environmental stress. A comprehensive picture of the interplay between different somatic aberrations, from point mutations to whole-genome duplications, in tumour initiation and progression is lacking. We posit that different genomic aberrations generally exhibit a temporal order, shaped by a balance between the levels of mutations and selective pressures. Repeat instability emerges first, followed by larger aberrations, with compensatory effects leading to robust tumour fitness maintained throughout the tumour progression. A better understanding of the interplay between genetic aberrations, the microenvironment, and epigenetic and metabolic cellular states is essential for early detection and prevention of cancer as well as development of efficient therapeutic strategies.
Collapse
Affiliation(s)
- Erez Persi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Horn
- School of Physics and Astronomy, Raymond & Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Eytan Ruppin
- Cancer Data Science Lab, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Demichelis
- Department for Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy.,Caryl and Israel Englander Institute for Precision Medicine, New York Presbyterian Hospital, Weill Cornell Medicine, New York, NY, USA
| | - Robert A Gatenby
- Integrated Mathematical Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| | - Robert J Gillies
- Department of Cancer Physiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA.
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
18
|
Naranpanawa DNU, Chandrasekara CHWMRB, Bandaranayake PCG, Bandaranayake AU. Raw transcriptomics data to gene specific SSRs: a validated free bioinformatics workflow for biologists. Sci Rep 2020; 10:18236. [PMID: 33106560 PMCID: PMC7588437 DOI: 10.1038/s41598-020-75270-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2019] [Accepted: 09/21/2020] [Indexed: 02/07/2023] Open
Abstract
Recent advances in next-generation sequencing technologies have paved the path for a considerable amount of sequencing data at a relatively low cost. This has revolutionized the genomics and transcriptomics studies. However, different challenges are now created in handling such data with available bioinformatics platforms both in assembly and downstream analysis performed in order to infer correct biological meaning. Though there are a handful of commercial software and tools for some of the procedures, cost of such tools has made them prohibitive for most research laboratories. While individual open-source or free software tools are available for most of the bioinformatics applications, those components usually operate standalone and are not combined for a user-friendly workflow. Therefore, beginners in bioinformatics might find analysis procedures starting from raw sequence data too complicated and time-consuming with the associated learning-curve. Here, we outline a procedure for de novo transcriptome assembly and Simple Sequence Repeats (SSR) primer design solely based on tools that are available online for free use. For validation of the developed workflow, we used Illumina HiSeq reads of different tissue samples of Santalum album (sandalwood), generated from a previous transcriptomics project. A portion of the designed primers were tested in the lab with relevant samples and all of them successfully amplified the targeted regions. The presented bioinformatics workflow can accurately assemble quality transcriptomes and develop gene specific SSRs. Beginner biologists and researchers in bioinformatics can easily utilize this workflow for research purposes.
Collapse
Affiliation(s)
- D N U Naranpanawa
- Agricultural Biotechnology Centre, Faculty of Agriculture, University of Peradeniya, Peradeniya, 20400, Sri Lanka
- Postgraduate Institute of Science, University of Peradeniya, Peradeniya, 20400, Sri Lanka
| | - C H W M R B Chandrasekara
- Agricultural Biotechnology Centre, Faculty of Agriculture, University of Peradeniya, Peradeniya, 20400, Sri Lanka
| | - P C G Bandaranayake
- Agricultural Biotechnology Centre, Faculty of Agriculture, University of Peradeniya, Peradeniya, 20400, Sri Lanka
| | - A U Bandaranayake
- Department of Computer Engineering, Faculty of Engineering, University of Peradeniya, Peradeniya, 20400, Sri Lanka.
| |
Collapse
|
19
|
Gupta G, Saini S. DAVI: Deep learning-based tool for alignment and single nucleotide variant identification. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2020. [DOI: 10.1088/2632-2153/ab7e19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Abstract
Next-generation sequencing (NGS) technologies have provided affordable but errorful ways to generate raw genetic data. To extract variant information from billions of NGS reads is still a daunting task which involves various hand-crafted and parameterized statistical tools. Here we propose a deep neural networks (DNN) based alignment and single nucleotide variant (SNV) identifier tool known as DAVI: deep alignment and variant identification. DAVI consists of models for both global and local alignment and for variant calling. We have evaluated the performance of DAVI against existing state-of-the-art tool sets and found that its accuracy and performance is comparable to existing tools used for bench-marking. We further demonstrate that while existing tools are based on data generated from a specific sequencing technology, the models proposed in DAVI are generic and can be used across different NGS technologies as well as across different species. The use of DAVI will therefore help non-human sequencing projects to benefit from the wealth of human ground truth data. Moreover, this approach is a migration from expert-driven statistical models to generic, automated, self-learning models.
Collapse
|
20
|
Potgieter L, Feurtey A, Dutheil JY, Stukenbrock EH. On Variant Discovery in Genomes of Fungal Plant Pathogens. Front Microbiol 2020; 11:626. [PMID: 32373089 PMCID: PMC7176817 DOI: 10.3389/fmicb.2020.00626] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/19/2020] [Indexed: 11/13/2022] Open
Abstract
Comparative genome analyses of eukaryotic pathogens including fungi and oomycetes have revealed extensive variability in genome composition and structure. The genomes of individuals from the same population can exhibit different numbers of chromosomes and different organization of chromosomal segments, defining so-called accessory compartments that have been shown to be crucial to pathogenicity in plant-infecting fungi. This high level of structural variation confers a methodological challenge for population genomic analyses. Variant discovery from population sequencing data is typically achieved using established pipelines based on the mapping of short reads to a reference genome. These pipelines have been developed, and extensively used, for eukaryote genomes of both plants and animals, to retrieve single nucleotide polymorphisms and short insertions and deletions. However, they do not permit the inference of large-scale genomic structural variation, as this task typically requires the alignment of complete genome sequences. Here, we compare traditional variant discovery approaches to a pipeline based on de novo genome assembly of short read data followed by whole genome alignment, using simulated data sets with properties mimicking that of fungal pathogen genomes. We show that the latter approach exhibits levels of performance comparable to that of read-mapping based methodologies, when used on sequence data with sufficient coverage. We argue that this approach further allows additional types of genomic diversity to be explored, in particular as long-read third-generation sequencing technologies are becoming increasingly available to generate population genomic data.
Collapse
Affiliation(s)
- Lizel Potgieter
- Environmental Genomics, Max Planck Institute for Evolutionary Biology, Plön, Germany
- Environmental Genomics, Christian-Albrechts University of Kiel, Kiel, Germany
| | - Alice Feurtey
- Environmental Genomics, Max Planck Institute for Evolutionary Biology, Plön, Germany
| | - Julien Y. Dutheil
- Molecular Systems Evolution, Max Planck Institute for Evolutionary Biology, Plön, Germany
| | - Eva H. Stukenbrock
- Environmental Genomics, Max Planck Institute for Evolutionary Biology, Plön, Germany
- Environmental Genomics, Christian-Albrechts University of Kiel, Kiel, Germany
| |
Collapse
|
21
|
Medvedev P. Modeling biological problems in computer science: a case study in genome assembly. Brief Bioinform 2020; 20:1376-1383. [PMID: 29394324 DOI: 10.1093/bib/bby003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Revised: 12/07/2017] [Indexed: 11/14/2022] Open
Abstract
As computer scientists working in bioinformatics/computational biology, we often face the challenge of coming up with an algorithm to answer a biological question. This occurs in many areas, such as variant calling, alignment and assembly. In this tutorial, we use the example of the genome assembly problem to demonstrate how to go from a question in the biological realm to a solution in the computer science realm. We show the modeling process step-by-step, including all the intermediate failed attempts. Please note this is not an introduction to how genome assembly algorithms work and, if treated as such, would be incomplete and unnecessarily long-winded.
Collapse
|
22
|
Ghansah A, Kamau E, Amambua-Ngwa A, Ishengoma DS, Maiga-Ascofare O, Amenga-Etego L, Deme A, Yavo W, Randrianarivelojosia M, Ochola-Oyier LI, Helegbe GK, Bailey J, Alifrangis M, Djimde A. Targeted Next Generation Sequencing for malaria research in Africa: current status and outlook. Malar J 2019; 18:324. [PMID: 31547818 PMCID: PMC6757370 DOI: 10.1186/s12936-019-2944-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 08/29/2019] [Indexed: 11/10/2022] Open
Abstract
Targeted Next Generation Sequencing (TNGS) is an efficient and economical Next Generation Sequencing (NGS) platform and the preferred choice when specific genomic regions are of interest. So far, only institutions located in middle and high-income countries have developed and implemented the technology, however, the efficiency and cost savings, as opposed to more traditional sequencing methodologies (e.g. Sanger sequencing) make the approach potentially well suited for resource-constrained regions as well. In April 2018, scientists from the Plasmodium Diversity Network Africa (PDNA) and collaborators met during the 7th Pan African Multilateral Initiative of Malaria (MIM) conference held in Dakar, Senegal to explore the feasibility of applying TNGS to genetic studies and malaria surveillance in Africa. The group of scientists reviewed the current experience with TNGS platforms in sub-Saharan Africa (SSA) and identified potential roles the technology might play to accelerate malaria research, scientific discoveries and improved public health in SSA. Research funding, infrastructure and human resources were highlighted as challenges that will have to be mitigated to enable African scientists to drive the implementation of TNGS in SSA. Current roles of important stakeholders and strategies to strengthen existing networks to effectively harness this powerful technology for malaria research of public health importance were discussed.
Collapse
Affiliation(s)
- Anita Ghansah
- Noguchi Memorial Institute for Medical Research, College of Health Sciences, University of Ghana, P. O. Box LG 581, Legon, Ghana
| | - Edwin Kamau
- Department of Emerging and Infectious Diseases (DEID), United States Army Medical Research Directorate -Africa (USAMR D-A), Kenya Medical Research Institute (KEMRI), Kisumu, Kenya.,U.S. Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD, USA
| | - Alfred Amambua-Ngwa
- Parasite Molecular Biology, Disease Control and Elimination, Medical Research Council Unit The Gambia at LSHTM, Atlantic Road Fajara, Banjul, The Gambia
| | | | | | - Lucas Amenga-Etego
- West Africa Centre for Cell Biology of Infectious Pathogens, College of Basic and Applied Sciences, University of Ghana, Legon, Ghana
| | - Awa Deme
- Department of Parasitology, Faculty of Medicine and Pharmacy, Cheikh Anta Diop University, Dakar, Senegal
| | - William Yavo
- Faculty of Pharmacy, Department of Parasitology and Mycology, Fe ́lix Houphoue ̈t-Boigny University, BPV 34, Abidjan, Côte d'Ivoire.,Malaria Research and Control Centre, National Institute of Public Health, BPV 47, Abidjan, Côte d'Ivoire
| | | | | | | | - Gideon Kofi Helegbe
- Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, University for Development Studies, P. O. Box TL1883, Tamale, Northern Region, Ghana
| | - Jeffery Bailey
- Warren Alpert Medical School, Brown University, 55 Claverick St, Rm 314B, Providence, RI, 02903, USA
| | - Michael Alifrangis
- Centre for Medical Parasitology, Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Abdoulaye Djimde
- Malaria Research and Training Centre, University of Science, Techniques and Technologies of Bamako, Bamako, Mali. .,Wellcome Trust Sanger Institute, Hinxton, UK.
| |
Collapse
|
23
|
Proteomic and genomic signatures of repeat instability in cancer and adjacent normal tissues. Proc Natl Acad Sci U S A 2019; 116:16987-16996. [PMID: 31387980 DOI: 10.1073/pnas.1908790116] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Repetitive sequences are hotspots of evolution at multiple levels. However, due to difficulties involved in their assembly and analysis, the role of repeats in tumor evolution is poorly understood. We developed a rigorous motif-based methodology to quantify variations in the repeat content, beyond microsatellites, in proteomes and genomes directly from proteomic and genomic raw data. This method was applied to a wide range of tumors and normal tissues. We identify high similarity between repeat instability patterns in tumors and their patient-matched adjacent normal tissues. Nonetheless, tumor-specific signatures both in protein expression and in the genome strongly correlate with cancer progression and robustly predict the tumorigenic state. In a patient, the hierarchy of genomic repeat instability signatures accurately reconstructs tumor evolution, with primary tumors differentiated from metastases. We observe an inverse relationship between repeat instability and point mutation load within and across patients independent of other somatic aberrations. Thus, repeat instability is a distinct, transient, and compensatory adaptive mechanism in tumor evolution and a potential signal for early detection.
Collapse
|
24
|
Pereira De Martinis EC, Almeida OGGD. Relating next-generation sequencing and bioinformatics concepts to routine microbiological testing. ELECTRONIC JOURNAL OF GENERAL MEDICINE 2019. [DOI: 10.29333/ejgm/108690] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
25
|
Sohn JI, Nam JW. The present and future of de novo whole-genome assembly. Brief Bioinform 2018; 19:23-40. [PMID: 27742661 DOI: 10.1093/bib/bbw096] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Indexed: 12/15/2022] Open
Abstract
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
Collapse
|
26
|
Hehir-Kwa JY, Tops BBJ, Kemmeren P. The clinical implementation of copy number detection in the age of next-generation sequencing. Expert Rev Mol Diagn 2018; 18:907-915. [PMID: 30221560 DOI: 10.1080/14737159.2018.1523723] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
INTRODUCTION The role of copy number variants (CNVs) in disease is now well established. In parallel NGS technologies, such as long-read technologies, there is continual development and data analysis methods continue to be refined. Clinical exome sequencing data is now a reality for many diagnostic laboratories in both congenital genetics and oncology. This provides the ability to detect and report both SNVs and structural variants, including CNVs, using a single assay for a wide range of patient cohorts. Areas covered: Currently, whole-genome sequencing is mainly restricted to research applications and clinical utility studies. Furthermore, detecting the full-size spectrum of CNVs as well as somatic events remains difficult for both exome and whole-genome sequencing. As a result, the full extent of genomic variants in an individual's genome is still largely unknown. Recently, new sequencing technologies have been introduced which maintain the long-range genomic context, aiding the detection of CNVs and structural variants. Expert commentary: The development of long-read sequencing promises to resolve many CNV and SV detection issues but is yet to become established. The current challenge for clinical CNV detection is how to fully exploit all the data which is generated by high throughput sequencing technologies.
Collapse
Affiliation(s)
- Jayne Y Hehir-Kwa
- a Princess Máxima Center for Pediatric Oncology , Utrecht , Netherlands
| | - Bastiaan B J Tops
- a Princess Máxima Center for Pediatric Oncology , Utrecht , Netherlands
| | - Patrick Kemmeren
- a Princess Máxima Center for Pediatric Oncology , Utrecht , Netherlands
| |
Collapse
|
27
|
Gärtner F, Höner zu Siederdissen C, Müller L, Stadler PF. Coordinate systems for supergenomes. Algorithms Mol Biol 2018; 13:15. [PMID: 30258487 PMCID: PMC6151955 DOI: 10.1186/s13015-018-0133-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 09/07/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Genome sequences and genome annotation data have become available at ever increasing rates in response to the rapid progress in sequencing technologies. As a consequence the demand for methods supporting comparative, evolutionary analysis is also growing. In particular, efficient tools to visualize-omics data simultaneously for multiple species are sorely lacking. A first and crucial step in this direction is the construction of a common coordinate system. Since genomes not only differ by rearrangements but also by large insertions, deletions, and duplications, the use of a single reference genome is insufficient, in particular when the number of species becomes large. RESULTS The computational problem then becomes to determine an order and orientations of optimal local alignments that are as co-linear as possible with all the genome sequences. We first review the most prominent approaches to model the problem formally and then proceed to showing that it can be phrased as a particular variant of the Betweenness Problem. It is NP hard in general. As exact solutions are beyond reach for the problem sizes of practical interest, we introduce a collection of heuristic simplifiers to resolve ordering conflicts. CONCLUSION Benchmarks on real-life data ranging from bacterial to fly genomes demonstrate the feasibility of computing good common coordinate systems.
Collapse
Affiliation(s)
- Fabian Gärtner
- Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
| | - Christian Höner zu Siederdissen
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
| | - Lydia Müller
- Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
- Automatic Language Processing Group, Department of Computer Science, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany
| | - Peter F. Stadler
- Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, 04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090 Vienna, Austria
- Center for non-coding RNA in Technology and Health, Grønegårdsvej 3, 1870 Frederiksberg C, Denmark
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA
| |
Collapse
|
28
|
SCOP: a novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018; 35:1142-1150. [DOI: 10.1093/bioinformatics/bty773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 08/10/2018] [Accepted: 09/01/2018] [Indexed: 12/20/2022] Open
|
29
|
Li M, Tang L, Liao Z, Luo J, Wu F, Pan Y, Wang J. A novel scaffolding algorithm based on contig error correction and path extension. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:764-773. [PMID: 30040649 DOI: 10.1109/tcbb.2018.2858267] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The sequence assembly process can be divided into three stages: contigs extension, scaffolding, and gap filling. The scaffolding method is an essential step during the process to infer the direction and sequence relationships between the contigs. However, scaffolding still faces the challenges of uneven sequencing depth, genome repetitive regions, and sequencing errors, which often leads to many false relationships between contigs. The performance of scaffolding can be improved by removing potential false conjunctions between contigs. In this study, a novel scaffolding algorithm which is on the basis of path extension Loose-Strict-Loose strategy and contig error correction, called iLSLS. iLSLS helps reduce the false relationships between contigs, and improve the accuracy of subsequent steps. iLSLS utilizes a scoring function, which estimates the correctness of candidate paths by the distribution of paired reads, and try to conduction the extension with the path which is scored the highest. What's more, iLSLS can precisely estimate the gap size. We conduct experiments on two real datasets, and the results show that LSLS strategy is efficient to increase the correctness of scaffolds, and iLSLS performs better than other scaffolding methods.
Collapse
|
30
|
Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP. Recent advances in sequence assembly: principles and applications. Brief Funct Genomics 2018; 16:361-378. [PMID: 28453648 DOI: 10.1093/bfgp/elx006] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly.
Collapse
|
31
|
Vale FF, Lehours P. Relating Phage Genomes to Helicobacter pylori Population Structure: General Steps Using Whole-Genome Sequencing Data. Int J Mol Sci 2018; 19:ijms19071831. [PMID: 29933614 PMCID: PMC6073503 DOI: 10.3390/ijms19071831] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2018] [Revised: 05/30/2018] [Accepted: 06/15/2018] [Indexed: 12/19/2022] Open
Abstract
The review uses the Helicobacter pylori, the gastric bacterium that colonizes the human stomach, to address how to obtain information from bacterial genomes about prophage biology. In a time of continuous growing number of genomes available, this review provides tools to explore genomes for prophage presence, or other mobile genetic elements and virulence factors. The review starts by covering the genetic diversity of H. pylori and then moves to the biologic basis and the bioinformatics approaches used for studding the H. pylori phage biology from their genomes and how this is related with the bacterial population structure. Aspects concerning H. pylori prophage biology, evolution and phylogeography are discussed.
Collapse
Affiliation(s)
- Filipa F Vale
- Host-Pathogen Interactions Unit, Research Institute for Medicines (iMed-ULisboa), Faculdade de Farmácia, Universidade de Lisboa, 1649-003 Lisboa, Portugal.
| | - Philippe Lehours
- Laboratoire de Bacteriologie, Centre National de Référence des Campylobacters et Hélicobacters, Place Amélie Raba Léon, 33076 Bordeaux, France.
- INSERM U1053-UMR Bordeaux Research in Translational Oncology, BaRITOn, 33000 Bordeaux, France.
| |
Collapse
|
32
|
Dilliott AA, Farhan SMK, Ghani M, Sato C, Liang E, Zhang M, McIntyre AD, Cao H, Racacho L, Robinson JF, Strong MJ, Masellis M, Bulman DE, Rogaeva E, Lang A, Tartaglia C, Finger E, Zinman L, Turnbull J, Freedman M, Swartz R, Black SE, Hegele RA. Targeted Next-generation Sequencing and Bioinformatics Pipeline to Evaluate Genetic Determinants of Constitutional Disease. J Vis Exp 2018. [PMID: 29683450 PMCID: PMC5933375 DOI: 10.3791/57266] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Next-generation sequencing (NGS) is quickly revolutionizing how research into the genetic determinants of constitutional disease is performed. The technique is highly efficient with millions of sequencing reads being produced in a short time span and at relatively low cost. Specifically, targeted NGS is able to focus investigations to genomic regions of particular interest based on the disease of study. Not only does this further reduce costs and increase the speed of the process, but it lessens the computational burden that often accompanies NGS. Although targeted NGS is restricted to certain regions of the genome, preventing identification of potential novel loci of interest, it can be an excellent technique when faced with a phenotypically and genetically heterogeneous disease, for which there are previously known genetic associations. Because of the complex nature of the sequencing technique, it is important to closely adhere to protocols and methodologies in order to achieve sequencing reads of high coverage and quality. Further, once sequencing reads are obtained, a sophisticated bioinformatics workflow is utilized to accurately map reads to a reference genome, to call variants, and to ensure the variants pass quality metrics. Variants must also be annotated and curated based on their clinical significance, which can be standardized by applying the American College of Medical Genetics and Genomics Pathogenicity Guidelines. The methods presented herein will display the steps involved in generating and analyzing NGS data from a targeted sequencing panel, using the ONDRISeq neurodegenerative disease panel as a model, to identify variants that may be of clinical significance.
Collapse
Affiliation(s)
- Allison A Dilliott
- Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University; Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University
| | - Sali M K Farhan
- Analytic and Translational Genetics Unit, Center for Genomic Medicine, Harvard Medical School, Massachusetts General Hospital, Stanley Centre for Psychiatric Research, Broad Institute of MIT and Harvard
| | - Mahdi Ghani
- Tanz Centre for Research in Neurodegenerative Diseases, University of Toronto
| | - Christine Sato
- Tanz Centre for Research in Neurodegenerative Diseases, University of Toronto
| | - Eric Liang
- School of Medicine, Faculty of Health Sciences, Queen's University
| | - Ming Zhang
- Tanz Centre for Research in Neurodegenerative Diseases, University of Toronto
| | - Adam D McIntyre
- Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University
| | - Henian Cao
- Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University
| | - Lemuel Racacho
- Faculty of Medicine, Department of Biochemistry, Microbiology and Immunology, University of Ottawa; CHEO Research Institute, Faculty of Medicine, University of Ottawa
| | - John F Robinson
- Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University
| | - Michael J Strong
- Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University; Department of Clinical Neurological Sciences, Western University
| | - Mario Masellis
- Division of Neurology, Department of Medicine, Sunnybrook Health Sciences Centre, University of Toronto; Division of Neurology, Department of Medicine, University of Toronto
| | - Dennis E Bulman
- Faculty of Medicine, Department of Biochemistry, Microbiology and Immunology, University of Ottawa; CHEO Research Institute, Faculty of Medicine, University of Ottawa
| | - Ekaterina Rogaeva
- Tanz Centre for Research in Neurodegenerative Diseases, University of Toronto
| | - Anthony Lang
- Division of Neurology, Department of Medicine, University of Toronto; Morton and Gloria Shulman Movement Disorders Centre, Toronto Western Hospital
| | - Carmela Tartaglia
- Tanz Centre for Research in Neurodegenerative Diseases, University of Toronto; Division of Neurology, Department of Medicine, University of Toronto
| | - Elizabeth Finger
- Department of Clinical Neurological Sciences, Schulich School of Medicine and Dentistry, Western University; Parkwood Institute, St. Joseph's Health Care
| | - Lorne Zinman
- Division of Neurology, Department of Medicine, Sunnybrook Health Sciences Centre, University of Toronto
| | - John Turnbull
- Department of Medicine, Division of Neurology, McMaster University
| | - Morris Freedman
- Division of Neurology, Department of Medicine, University of Toronto; Division of Neurology, Department of Medicine, Baycrest Health Sciences
| | - Rick Swartz
- Division of Neurology, Department of Medicine, Sunnybrook Health Sciences Centre, University of Toronto
| | - Sandra E Black
- Division of Neurology, Department of Medicine, Sunnybrook Health Sciences Centre, University of Toronto; Canadian Partnership for Stroke Recovery Sunnybrook Site, Sunnybrook Health Science Centre, University of Toronto
| | - Robert A Hegele
- Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University; Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University;
| |
Collapse
|
33
|
Abstract
The output from whole genome sequencing is a set of contigs, i.e. short non-overlapping DNA sequences (sizes 1-100 kilobasepairs). Piecing the contigs together is an especially difficult task for previously unsequenced DNA, and may not be feasible due to factors such as the lack of sufficient coverage or larger repetitive regions which generate gaps in the final sequence. Here we propose a new method for scaffolding such contigs. The proposed method uses densely labeled optical DNA barcodes from competitive binding experiments as scaffolds. On these scaffolds we position theoretical barcodes which are calculated from the contig sequences. This allows us to construct longer DNA sequences from the contig sequences. This proof-of-principle study extends previous studies which use sparsely labeled DNA barcodes for scaffolding purposes. Our method applies a probabilistic approach that allows us to discard “foreign” contigs from mixed samples with contigs from different types of DNA. We satisfy the contig non-overlap constraint by formulating the contig placement challenge as a combinatorial auction problem. Our exact algorithm for solving this problem reduces computational costs compared to previous methods in the combinatorial auction field. We demonstrate the usefulness of the proposed scaffolding method both for synthetic contigs and for contigs obtained using Illumina sequencing for a mixed sample with plasmid and chromosomal DNA.
Collapse
|
34
|
Larsen PA, Harris RA, Liu Y, Murali SC, Campbell CR, Brown AD, Sullivan BA, Shelton J, Brown SJ, Raveendran M, Dudchenko O, Machol I, Durand NC, Shamim MS, Aiden EL, Muzny DM, Gibbs RA, Yoder AD, Rogers J, Worley KC. Hybrid de novo genome assembly and centromere characterization of the gray mouse lemur (Microcebus murinus). BMC Biol 2017; 15:110. [PMID: 29145861 PMCID: PMC5689209 DOI: 10.1186/s12915-017-0439-6] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Accepted: 10/10/2017] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND The de novo assembly of repeat-rich mammalian genomes using only high-throughput short read sequencing data typically results in highly fragmented genome assemblies that limit downstream applications. Here, we present an iterative approach to hybrid de novo genome assembly that incorporates datasets stemming from multiple genomic technologies and methods. We used this approach to improve the gray mouse lemur (Microcebus murinus) genome from early draft status to a near chromosome-scale assembly. METHODS We used a combination of advanced genomic technologies to iteratively resolve conflicts and super-scaffold the M. murinus genome. RESULTS We improved the M. murinus genome assembly to a scaffold N50 of 93.32 Mb. Whole genome alignments between our primary super-scaffolds and 23 human chromosomes revealed patterns that are congruent with historical comparative cytogenetic data, thus demonstrating the accuracy of our de novo scaffolding approach and allowing assignment of scaffolds to M. murinus chromosomes. Moreover, we utilized our independent datasets to discover and characterize sequences associated with centromeres across the mouse lemur genome. Quality assessment of the final assembly found 96% of mouse lemur canonical transcripts nearly complete, comparable to other published high-quality reference genome assemblies. CONCLUSIONS We describe a new assembly of the gray mouse lemur (Microcebus murinus) genome with chromosome-scale scaffolds produced using a hybrid bioinformatic and sequencing approach. The approach is cost effective and produces superior results based on metrics of contiguity and completeness. Our results show that emerging genomic technologies can be used in combination to characterize centromeres of non-model species and to produce accurate de novo chromosome-scale genome assemblies of complex mammalian genomes.
Collapse
Affiliation(s)
- Peter A. Larsen
- Department of Biology, Duke University, Durham, NC 27708 USA
| | - R. Alan Harris
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Yue Liu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
| | - Shwetha C. Murali
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
- Present address: Department of Genome Sciences, University of Washington, Seattle, WA 98195 USA
| | | | - Adam D. Brown
- Department of Pharmacology and Cancer Biology, Duke University, Durham, NC 27710 USA
- Present address: Bristol Myers-Squibb, 420 W Round Grove Rd, Lewisville, TX 75067 USA
| | - Beth A. Sullivan
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27710 USA
| | - Jennifer Shelton
- Kansas State University Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS 66506 USA
- Present address: New York Genome Center, 101 Avenue of the Americas, New York, NY 10013 USA
| | - Susan J. Brown
- Kansas State University Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS 66506 USA
| | | | - Olga Dudchenko
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
- The Center for Theoretical Biological Physics, Rice University, Houston, TX 77005 USA
- Department of Computer Science, Rice University, Houston, TX 77005 USA
| | - Ido Machol
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
- The Center for Theoretical Biological Physics, Rice University, Houston, TX 77005 USA
- Department of Computer Science, Rice University, Houston, TX 77005 USA
| | - Neva C. Durand
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
- The Center for Theoretical Biological Physics, Rice University, Houston, TX 77005 USA
- Department of Computer Science, Rice University, Houston, TX 77005 USA
| | - Muhammad S. Shamim
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
- The Center for Theoretical Biological Physics, Rice University, Houston, TX 77005 USA
- Department of Computer Science, Rice University, Houston, TX 77005 USA
| | - Erez Lieberman Aiden
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
- The Center for Theoretical Biological Physics, Rice University, Houston, TX 77005 USA
- Department of Computer Science, Rice University, Houston, TX 77005 USA
| | - Donna M. Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Richard A. Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Anne D. Yoder
- Department of Biology, Duke University, Durham, NC 27708 USA
| | - Jeffrey Rogers
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Kim C. Worley
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| |
Collapse
|
35
|
Two Efficient Techniques to Find Approximate Overlaps between Sequences. BIOMED RESEARCH INTERNATIONAL 2017; 2017:2731385. [PMID: 28293632 PMCID: PMC5331309 DOI: 10.1155/2017/2731385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 12/23/2016] [Accepted: 01/17/2017] [Indexed: 11/17/2022]
Abstract
The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving for the purpose of solving an approximate version of the overlap problem. The main benefit of this direction is the ability to skip time-consuming error-detecting techniques which are applied in the prefiltering stage. In this work, we present and compare two techniques to solve the approximate overlap problem. The first adapts a compact prefix tree to efficiently solve the approximate all-pairs suffix-prefix problem, while the other utilizes a well-known principle, namely, the pigeonhole principle, to identify a potential overlap match in order to ultimately solve the same problem. Our results show that our solution using the pigeonhole principle has better space and time consumption over an FM-based solution, while our solution based on prefix tree has the best space consumption between all three solutions. The number of mismatches (hamming distance) is used to define the approximate matching between strings in our work.
Collapse
|
36
|
Gasc C, Peyret P. Revealing large metagenomic regions through long DNA fragment hybridization capture. MICROBIOME 2017; 5:33. [PMID: 28292322 PMCID: PMC5351058 DOI: 10.1186/s40168-017-0251-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 03/05/2017] [Indexed: 05/07/2023]
Abstract
BACKGROUND High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes from single organisms or metagenomic samples. However, due to the limited capacity of short-read sequence data to assemble complex or low coverage regions, genomes are typically fragmented, leading to draft genomes with numerous underexplored large genomic regions. Revealing these missing sequences is a major goal to resolve concerns in numerous biological studies. METHODS To overcome these limitations, we developed an innovative target enrichment method for the reconstruction of large unknown genomic regions. Based on a hybridization capture strategy, this approach enables the enrichment of large genomic regions allowing the reconstruction of tens of kilobase pairs flanking a short, targeted DNA sequence. RESULTS Applied to a metagenomic soil sample targeting the linA gene, the biomarker of hexachlorocyclohexane (HCH) degradation, our method permitted the enrichment of the gene and its flanking regions leading to the reconstruction of several contigs and complete plasmids exceeding tens of kilobase pairs surrounding linA. Thus, through gene association and genome reconstruction, we identified microbial species involved in HCH degradation which constitute targets to improve biostimulation treatments. CONCLUSIONS This new hybridization capture strategy makes surveying and deconvoluting complex genomic regions possible through large genomic regions enrichment and allows the efficient exploration of metagenomic diversity. Indeed, this approach enables to assign identity and function to microorganisms in natural environments, one of the ultimate goals of microbial ecology.
Collapse
Affiliation(s)
- Cyrielle Gasc
- Université Clermont Auvergne, INRA, MEDIS, 63000 Clermont-Ferrand, France
| | - Pierre Peyret
- Université Clermont Auvergne, INRA, MEDIS, 63000 Clermont-Ferrand, France
| |
Collapse
|
37
|
Pirih N, Kunej T. Toward a Taxonomy for Multi-Omics Science? Terminology Development for Whole Genome Study Approaches by Omics Technology and Hierarchy. ACTA ACUST UNITED AC 2017; 21:1-16. [DOI: 10.1089/omi.2016.0144] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Affiliation(s)
- Nina Pirih
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Domzale, Slovenia
| | - Tanja Kunej
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Domzale, Slovenia
| |
Collapse
|
38
|
Chan CX, Beiko RG, Ragan MA. Scaling Up the Phylogenetic Detection of Lateral Gene Transfer Events. Methods Mol Biol 2017; 1525:421-432. [PMID: 27896730 DOI: 10.1007/978-1-4939-6622-6_16] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Lateral genetic transfer (LGT) is the process by which genetic material moves between organisms (and viruses) in the biosphere. Among the many approaches developed for the inference of LGT events from DNA sequence data, methods based on the comparison of phylogenetic trees remain the gold standard for many types of problem. Identifying LGT events from sequenced genomes typically involves a series of steps in which homologous sequences are identified and aligned, phylogenetic trees are inferred, and their topologies are compared to identify unexpected or conflicting relationships. These types of approach have been used to elucidate the nature and extent of LGT and its physiological and ecological consequences throughout the Tree of Life. Advances in DNA sequencing technology have led to enormous increases in the number of sequenced genomes, including ultra-deep sampling of specific taxonomic groups and single cell-based sequencing of unculturable "microbial dark matter." Environmental shotgun sequencing enables the study of LGT among organisms that share the same habitat.This abundance of genomic data offers new opportunities for scientific discovery, but poses two key problems. As ever more genomes are generated, the assembly and annotation of each individual genome receives less scrutiny; and with so many genomes available it is tempting to include them all in a single analysis, but thousands of genomes and millions of genes can overwhelm key algorithms in the analysis pipeline. Identifying LGT events of interest therefore depends on choosing the right dataset, and on algorithms that appropriately balance speed and accuracy given the size and composition of the chosen set of genomes.
Collapse
Affiliation(s)
- Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS, B3H 4R2, Canada
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia.
| |
Collapse
|
39
|
Ebenezer TE, Carrington M, Lebert M, Kelly S, Field MC. Euglena gracilis Genome and Transcriptome: Organelles, Nuclear Genome Assembly Strategies and Initial Features. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2017; 979:125-140. [PMID: 28429320 DOI: 10.1007/978-3-319-54910-1_7] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Euglena gracilis is a major component of the aquatic ecosystem and together with closely related species, is ubiquitous worldwide. Euglenoids are an important group of protists, possessing a secondarily acquired plastid and are relatives to the Kinetoplastidae, which themselves have global impact as disease agents. To understand the biology of E. gracilis, as well as to provide further insight into the evolution and origins of the Kinetoplastidae, we embarked on sequencing the nuclear genome; the plastid and mitochondrial genomes are already in the public domain. Earlier studies suggested an extensive nuclear DNA content, with likely a high degree of repetitive sequence, together with significant extrachromosomal elements. To produce a list of coding sequences we have combined transcriptome data from both published and new sources, as well as embarked on de novo sequencing using a combination of 454, Illumina paired end libraries and long PacBio reads. Preliminary analysis suggests a surprisingly large genome approaching 2 Gbp, with a highly fragmented architecture and extensive repeat composition. Over 80% of the RNAseq reads from E. gracilis maps to the assembled genome sequence, which is comparable with the well assembled genomes of T. brucei and T. cruzi. In order to achieve this level of assembly we employed multiple informatics pipelines, which are discussed here. Finally, as a preliminary view of the genome architecture, we discuss the tubulin and calmodulin genes, which highlight potential novel splicing mechanisms.
Collapse
Affiliation(s)
- ThankGod Echezona Ebenezer
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QW, UK.,School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - Mark Carrington
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QW, UK
| | - Michael Lebert
- Cell Biology Division, Department of Biology, University of Erlangen-Nuremberg, Staudtstraβe 5, Erlangen, 91058, Germany
| | - Steven Kelly
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, OX1 3RB, UK.
| | - Mark C Field
- School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK.
| |
Collapse
|
40
|
Amirkhah R, Farazmand A, Wolkenhauer O, Schmitz U. RNA Systems Biology for Cancer: From Diagnosis to Therapy. Methods Mol Biol 2016; 1386:305-30. [PMID: 26677189 DOI: 10.1007/978-1-4939-3283-2_14] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
It is due to the advances in high-throughput omics data generation that RNA species have re-entered the focus of biomedical research. International collaborate efforts, like the ENCODE and GENCODE projects, have spawned thousands of previously unknown functional non-coding RNAs (ncRNAs) with various but primarily regulatory roles. Many of these are linked to the emergence and progression of human diseases. In particular, interdisciplinary studies integrating bioinformatics, systems biology, and biotechnological approaches have successfully characterized the role of ncRNAs in different human cancers. These efforts led to the identification of a new tool-kit for cancer diagnosis, monitoring, and treatment, which is now starting to enter and impact on clinical practice. This chapter is to elaborate on the state of the art in RNA systems biology, including a review and perspective on clinical applications toward an integrative RNA systems medicine approach. The focus is on the role of ncRNAs in cancer.
Collapse
Affiliation(s)
- Raheleh Amirkhah
- Department of Cell and Molecular Biology, School of Biology, College of Science, University of Tehran, Tehran, Iran
| | - Ali Farazmand
- Department of Cell and Molecular Biology, School of Biology, College of Science, University of Tehran, Tehran, Iran
| | - Olaf Wolkenhauer
- Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.,Stellenbosch Institute for Advanced Study (STIAS), Wallenberg Research Centre at Stellenbosch University, Stellenbosch, South Africa
| | - Ulf Schmitz
- Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| |
Collapse
|
41
|
LaHaye S, Corsmeier D, Basu M, Bowman JL, Fitzgerald-Butt S, Zender G, Bosse K, McBride KL, White P, Garg V. Utilization of Whole Exome Sequencing to Identify Causative Mutations in Familial Congenital Heart Disease. ACTA ACUST UNITED AC 2016; 9:320-9. [PMID: 27418595 DOI: 10.1161/circgenetics.115.001324] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2015] [Accepted: 06/27/2016] [Indexed: 12/19/2022]
Abstract
BACKGROUND Congenital heart disease (CHD) is the most common type of birth defect with family- and population-based studies supporting a strong genetic cause for CHD. The goal of this study was to determine whether a whole exome sequencing (WES) approach could identify pathogenic-segregating variants in multiplex CHD families. METHODS AND RESULTS WES was performed on 9 kindreds with familial CHD, 4 with atrial septal defects, 2 with patent ductus arteriosus, 2 with tetralogy of Fallot, and 1 with pulmonary valve dysplasia. Rare variants (<1% minor allele frequency) that segregated with disease were identified by WES, and variants in 69 CHD candidate genes were further analyzed. These selected variants were subjected to in silico analysis to predict pathogenicity and resulted in the discovery of likely pathogenic mutations in 3 of 9 (33%) families. A GATA4 mutation in the transactivation domain, p.G115W, was identified in familial atrial septal defects and demonstrated decreased transactivation ability in vitro. A p.I263V mutation in TLL1 was identified in an atrial septal defects kindred and is predicted to affect the enzymatic functionality of TLL1. A disease-segregating splice donor site mutation in MYH11 (c.4599+1delG) was identified in familial patent ductus arteriosus and found to disrupt normal splicing of MYH11 mRNA in the affected individual. CONCLUSIONS Our findings demonstrate the clinical utility of WES to identify causative mutations in familial CHD and demonstrate the successful use of a CHD candidate gene list to allow for a more streamlined approach enabling rapid prioritization and identification of likely pathogenic variants from large WES data sets. CLINICAL TRIAL REGISTRATION URL: https://clinicaltrials.gov; Unique Identifier: NCT0112048.
Collapse
Affiliation(s)
- Stephanie LaHaye
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Don Corsmeier
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Madhumita Basu
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Jessica L Bowman
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Sara Fitzgerald-Butt
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Gloria Zender
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Kevin Bosse
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Kim L McBride
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus
| | - Peter White
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus.
| | - Vidu Garg
- From the Center for Cardiovascular Research, The Research Institute (S.L., M.B., S.F.-B., G.Z., K.B., K.L.M., V.G.), The Heart Center (S.L., M.B., J.L.B., S.F.-B., K.L.M., V.G.), and Biomedical Genomics Core and the Center for Microbial Pathogenesis, The Research Institute (D.C., P.W.), Nationwide Children's Hospital, Columbus, OH; and Department of Molecular Genetics (S.L., V.G.) and Department of Pediatrics (J.L.B., S.F.-B., K.L.M., P.W., V.G.), The Ohio State University, Columbus.
| |
Collapse
|
42
|
El-Metwally S, Zakaria M, Hamza T. LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32:3215-3223. [PMID: 27412092 DOI: 10.1093/bioinformatics/btw470] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 06/28/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The deluge of current sequenced data has exceeded Moore's Law, more than doubling every 2 years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory. RESULTS LightAssembler is a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of [Formula: see text]-spaced sequenced [Formula: see text]-mers and the other holding [Formula: see text]-mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by [Formula: see text] compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage. AVAILABILITY AND IMPLEMENTATION https://github.com/SaraEl-Metwally/LightAssembler CONTACT: sarah_almetwally4@mans.edu.egSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara El-Metwally
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| | - Magdi Zakaria
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| | - Taher Hamza
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
43
|
Kelley JL, Brown AP, Therkildsen NO, Foote AD. The life aquatic: advances in marine vertebrate genomics. Nat Rev Genet 2016; 17:523-34. [DOI: 10.1038/nrg.2016.66] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
44
|
Helmy M, Awad M, Mosa KA. Limited resources of genome sequencing in developing countries: Challenges and solutions. Appl Transl Genom 2016; 9:15-9. [PMID: 27354935 PMCID: PMC4911431 DOI: 10.1016/j.atg.2016.03.003] [Citation(s) in RCA: 79] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The differences between countries in national income, growth, human development and many other factors are used to classify countries into developed and developing countries. There are several classification systems that use different sets of measures and criteria. The most common classifications are the United Nations (UN) and the World Bank (WB) systems. The UN classification system uses the UN Human Development Index (HDI), an indicator that uses statistic of life expectancy, education, and income per capita for countries' classification. While the WB system uses gross national income (GNI) per capita that is calculated using the World Bank Atlas method. According to the UN and WB classification systems, there are 151 and 134 developing countries, respectively, with 89% overlap between the two systems. Developing countries have limited human development, and limited expenditure in education and research, among several other limitations. The biggest challenge facing genomic researchers and clinicians is limited resources. As a result, genomic tools, specifically genome sequencing technologies, which are rapidly becoming indispensable, are not widely available. In this report, we explore the current status of sequencing technologies in developing countries, describe the associated challenges and emphasize potential solutions.
Collapse
Affiliation(s)
- Mohamed Helmy
- Donnelly Centre for Cellular and Biomedical Research, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Mohamed Awad
- Department of Biotechnology, Faculty of Agriculture, Al-Azhar University, Cairo 11651, Egypt
| | - Kareem A. Mosa
- Department of Biotechnology, Faculty of Agriculture, Al-Azhar University, Cairo 11651, Egypt
- Department of Applied Biology, College of Sciences, University of Sharjah, P.O. Box 27272, Sharjah, United Arab Emirates
- Corresponding author at: Department of Biotechnology, Faculty of Agriculture, Al-Azhar University, Cairo 11651, Egypt; Department of Applied Biology, College of Sciences, University of Sharjah, P.O. Box 27272, Sharjah, United Arab Emirates.Department of Biotechnology, Faculty of Agriculture, Al-Azhar University, Cairo 11651, Egypt; Department of Applied Biology, College of Sciences, University of SharjahP.O. Box 27272SharjahUnited Arab Emirates
| |
Collapse
|
45
|
Evtushenko EV, Levitsky VG, Elisafenko EA, Gunbin KV, Belousov AI, Šafář J, Doležel J, Vershinin AV. The expansion of heterochromatin blocks in rye reflects the co-amplification of tandem repeats and adjacent transposable elements. BMC Genomics 2016; 17:337. [PMID: 27146967 PMCID: PMC4857426 DOI: 10.1186/s12864-016-2667-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Accepted: 04/25/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A prominent and distinctive feature of the rye (Secale cereale) chromosomes is the presence of massive blocks of subtelomeric heterochromatin, the size of which is correlated with the copy number of tandem arrays. The rapidity with which these regions have formed over the period of speciation remains unexplained. RESULTS Using a BAC library created from the short arm telosome of rye chromosome 1R we uncovered numerous arrays of the pSc200 and pSc250 tandem repeat families which are concentrated in subtelomeric heterochromatin and identified the adjacent DNA sequences. The arrays show significant heterogeneity in monomer organization. 454 reads were used to gain a representation of the expansion of these tandem repeats across the whole rye genome. The presence of multiple, relatively short monomer arrays, coupled with the mainly star-like topology of the monomer phylogenetic trees, was taken as indicative of a rapid expansion of the pSc200 and pSc250 arrays. The evolution of subtelomeric heterochromatin appears to have included a significant contribution of illegitimate recombination. The composition of transposable elements (TEs) within the regions flanking the pSc200 and pSc250 arrays differed markedly from that in the genome a whole. Solo-LTRs were strongly enriched, suggestive of a history of active ectopic exchange. Several DNA motifs were over-represented within the LTR sequences. CONCLUSION The large blocks of subtelomeric heterochromatin have arisen from the combined activity of TEs and the expansion of the tandem repeats. The expansion was likely based on a highly complex network of recombination mechanisms.
Collapse
Affiliation(s)
- E V Evtushenko
- Institute of Molecular and Cellular Biology, Siberian Branch of the RAS, Novosibirsk, Russia
| | - V G Levitsky
- Institute of Cytology and Genetics, Siberian Branch of the RAS, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
| | - E A Elisafenko
- Institute of Cytology and Genetics, Siberian Branch of the RAS, Novosibirsk, Russia
| | - K V Gunbin
- Institute of Cytology and Genetics, Siberian Branch of the RAS, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
| | - A I Belousov
- Institute of Molecular and Cellular Biology, Siberian Branch of the RAS, Novosibirsk, Russia
| | - J Šafář
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | - J Doležel
- Institute of Experimental Botany, Centre of the Region Haná for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | - A V Vershinin
- Institute of Molecular and Cellular Biology, Siberian Branch of the RAS, Novosibirsk, Russia.
| |
Collapse
|
46
|
Mariano DCB, Sousa TDJ, Pereira FL, Aburjaile F, Barh D, Rocha F, Pinto AC, Hassan SS, Saraiva TDL, Dorella FA, de Carvalho AF, Leal CAG, Figueiredo HCP, Silva A, Ramos RTJ, Azevedo VAC. Whole-genome optical mapping reveals a mis-assembly between two rRNA operons of Corynebacterium pseudotuberculosis strain 1002. BMC Genomics 2016; 17:315. [PMID: 27129708 PMCID: PMC4851793 DOI: 10.1186/s12864-016-2673-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Accepted: 04/22/2016] [Indexed: 12/13/2022] Open
Abstract
Background Studies have detected mis-assemblies in genomes of the species Corynebacterium pseudotuberculosis. These new discover have been possible due to the evolution of the Next-Generation Sequencing platforms, which have provided sequencing with accuracy and reduced costs. In addition, the improving of techniques for construction of high accuracy genomic maps, for example, Whole-genome mapping (WGM) (OpGen Inc), have allow high-resolution assembly that can detect large rearrangements. Results In this work, we present the resequencing of Corynebacterium pseudotuberculosis strain 1002 (Cp1002). Cp1002 was the first strain of this species sequenced in Brazil, and its genome has been used as model for several studies in silico of caseous lymphadenitis disease. The sequencing was performed using the platform Ion PGM and fragment library (200 bp kit). A restriction map was constructed, using the technique of WGM with the enzyme KpnI. After the new assembly process, using WGM as scaffolder, we detected a large inversion with size bigger than one-half of genome. A specific analysis using BLAST and NR database shows that the inversion occurs between two homology RNA ribosomal regions. Conclusion In conclusion, the results showed by WGM could be used to detect mismatches in assemblies, providing genomic maps with high resolution and allow assemblies with more accuracy and completeness. The new assembly of C. pseudotuberculosis was deposited in GenBank under the accession no. CP012837. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2673-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Diego César Batista Mariano
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Thiago de Jesus Sousa
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Felipe Luiz Pereira
- National Reference Laboratory for Aquatic Animal Diseases of Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Flávia Aburjaile
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Debmalya Barh
- Centre for Genomics and Applied Gene Technology, Institute of Integrative Omics and Applied Biotechnology (IIOAB), Nonakuri, Purba Medinipur, WB, 721172, India
| | - Flávia Rocha
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Anne Cybelle Pinto
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Syed Shah Hassan
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Tessália Diniz Luerce Saraiva
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Fernanda Alves Dorella
- National Reference Laboratory for Aquatic Animal Diseases of Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Alex Fiorini de Carvalho
- National Reference Laboratory for Aquatic Animal Diseases of Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Carlos Augusto Gomes Leal
- National Reference Laboratory for Aquatic Animal Diseases of Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Henrique César Pereira Figueiredo
- National Reference Laboratory for Aquatic Animal Diseases of Ministry of Fisheries and Aquaculture, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil
| | - Artur Silva
- Institute of Biological Sciences, Federal University of Pará, Belém, Pará, Brazil
| | | | - Vasco Ariston Carvalho Azevedo
- Laboratory of Cellular and Molecular Genetics, Department of General Biology, Institute of Biological Sciences, Federal University of Minas Gerais, CEP 31270-901, Belo Horizonte, Minas Gerais, Brazil.
| |
Collapse
|
47
|
Beltman JB, Urbanus J, Velds A, van Rooij N, Rohr JC, Naik SH, Schumacher TN. Reproducibility of Illumina platform deep sequencing errors allows accurate determination of DNA barcodes in cells. BMC Bioinformatics 2016; 17:151. [PMID: 27038897 PMCID: PMC4818877 DOI: 10.1186/s12859-016-0999-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2015] [Accepted: 03/23/2016] [Indexed: 12/31/2022] Open
Abstract
Background Next generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags. Results Here, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences. Conclusions Application of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0999-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Joost B Beltman
- Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands. .,Division of Toxicology, Leiden Academic Centre for Drug Research, Leiden University, 2333 CC, Leiden, The Netherlands.
| | - Jos Urbanus
- Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
| | - Arno Velds
- Genomics Core Facility, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
| | - Nienke van Rooij
- Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
| | - Jan C Rohr
- Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands.,Center for Chronic Immunodeficiency (CCI), University Medical Center Freiburg and University of Freiburg, Freiburg, Germany
| | - Shalin H Naik
- Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands.,Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC, 3052, Australia.,Department of Medical Biology, The University of Melbourne, Parkville, VIC, 3010, Australia
| | - Ton N Schumacher
- Division of Immunology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands.
| |
Collapse
|
48
|
Alves JMP, de Oliveira AL, Sandberg TOM, Moreno-Gallego JL, de Toledo MAF, de Moura EMM, Oliveira LS, Durham AM, Mehnert DU, Zanotto PMDA, Reyes A, Gruber A. GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and its Application in Alpavirinae Viral Discovery from Metagenomic Data. Front Microbiol 2016; 7:269. [PMID: 26973638 PMCID: PMC4777721 DOI: 10.3389/fmicb.2016.00269] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Accepted: 02/19/2016] [Indexed: 01/01/2023] Open
Abstract
This work reports the development of GenSeed-HMM, a program that implements seed-driven progressive assembly, an approach to reconstruct specific sequences from unassembled data, starting from short nucleotide or protein seed sequences or profile Hidden Markov Models (HMM). The program can use any one of a number of sequence assemblers. Assembly is performed in multiple steps and relatively few reads are used in each cycle, consequently the program demands low computational resources. As a proof-of-concept and to demonstrate the power of HMM-driven progressive assemblies, GenSeed-HMM was applied to metagenomic datasets in the search for diverse ssDNA bacteriophages from the recently described Alpavirinae subfamily. Profile HMMs were built using Alpavirinae-specific regions from multiple sequence alignments (MSA) using either the viral protein 1 (VP1; major capsid protein) or VP4 (genome replication initiation protein). These profile HMMs were used by GenSeed-HMM (running Newbler assembler) as seeds to reconstruct viral genomes from sequencing datasets of human fecal samples. All contigs obtained were annotated and taxonomically classified using similarity searches and phylogenetic analyses. The most specific profile HMM seed enabled the reconstruction of 45 partial or complete Alpavirinae genomic sequences. A comparison with conventional (global) assembly of the same original dataset, using Newbler in a standalone execution, revealed that GenSeed-HMM outperformed global genomic assembly in several metrics employed. This approach is capable of detecting organisms that have not been used in the construction of the profile HMM, which opens up the possibility of diagnosing novel viruses, without previous specific information, constituting a de novo diagnosis. Additional applications include, but are not limited to, the specific assembly of extrachromosomal elements such as plastid and mitochondrial genomes from metagenomic data. Profile HMM seeds can also be used to reconstruct specific protein coding genes for gene diversity studies, and to determine all possible gene variants present in a metagenomic sample. Such surveys could be useful to detect the emergence of drug-resistance variants in sensitive environments such as hospitals and animal production facilities, where antibiotics are regularly used. Finally, GenSeed-HMM can be used as an adjunct for gap closure on assembly finishing projects, by using multiple contig ends as anchored seeds.
Collapse
Affiliation(s)
- João M P Alves
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - André L de Oliveira
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Tatiana O M Sandberg
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | | | - Marcelo A F de Toledo
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Elisabeth M M de Moura
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Liliane S Oliveira
- Department of Parasitology, Institute of Biomedical Sciences, University of São PauloSão Paulo, Brazil; Department of Computer Science, Institute of Mathematics and Statistics, University of São PauloSão Paulo, Brazil
| | - Alan M Durham
- Department of Computer Science, Institute of Mathematics and Statistics, University of São Paulo São Paulo, Brazil
| | - Dolores U Mehnert
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Paolo M de A Zanotto
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Alejandro Reyes
- Department of Biological Sciences, Universidad de los AndesBogotá, Colombia; Center for Genome Sciences and Systems Biology, Department of Pathology and Immunology, Washington University in Saint LouisMO, USA
| | - Arthur Gruber
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| |
Collapse
|
49
|
Orsini M, Cuccuru G, Uva P, Fotia G. Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era. Methods Mol Biol 2016; 1415:407-422. [PMID: 27115645 DOI: 10.1007/978-1-4939-3572-7_21] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Bacterial genome sequencing is now an affordable choice for many laboratories for applications in research, diagnostic, and clinical microbiology. Nowadays, an overabundance of tools is available for genomic data analysis. However, tools differ for algorithms, languages, hardware requirements, and user interface, and combining them as it is necessary for sequence data interpretation often requires (bio)informatics skills which can be difficult to find in many laboratories. In addition, multiple data sources, as well as exceedingly large dataset sizes, and increasingly computational complexity further challenge the accessibility, reproducibility, and transparency of the entire process. In this chapter we will cover the main bioinformatics steps required for a complete bacterial genome analysis using next-generation sequencing data, from the raw sequence data to assembled and annotated genomes. All the tools described are available in the Orione framework ( http://orione.crs4.it ), which uniquely combines in a transparent way the most used open source bioinformatics tools for microbiology, allowing microbiologist without any specific hardware or informatics skill to conduct data-intensive computational analyses from quality control to microbial gene annotation.
Collapse
Affiliation(s)
- Massimiliano Orsini
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy
| | - Gianmauro Cuccuru
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy
| | - Paolo Uva
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy
| | - Giorgio Fotia
- CRS4, Science and Technology Park Polaris, Piscina Manna, 09010, Pula, CA, Italy.
| |
Collapse
|
50
|
Perales C, Quer J, Gregori J, Esteban JI, Domingo E. Resistance of Hepatitis C Virus to Inhibitors: Complexity and Clinical Implications. Viruses 2015; 7:5746-66. [PMID: 26561827 PMCID: PMC4664975 DOI: 10.3390/v7112902] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Revised: 10/23/2015] [Accepted: 10/26/2015] [Indexed: 12/20/2022] Open
Abstract
Selection of inhibitor-resistant viral mutants is universal for viruses that display quasi-species dynamics, and hepatitis C virus (HCV) is no exception. Here we review recent results on drug resistance in HCV, with emphasis on resistance to the newly-developed, directly-acting antiviral agents, as they are increasingly employed in the clinic. We put the experimental observations in the context of quasi-species dynamics, in particular what the genetic and phenotypic barriers to resistance mean in terms of exploration of sequence space while HCV replicates in the liver of infected patients or in cell culture. Strategies to diminish the probability of viral breakthrough during treatment are briefly outlined.
Collapse
Affiliation(s)
- Celia Perales
- Liver Unit, Internal Medicine, Laboratory of Malalties Hepàtiques, Vall d'Hebron Institut de Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Universitat Autònoma de Barcelona, 08035 Barcelona, Spain.
- Centro de Biologia Molecular "Severo Ochoa" (CSIC-UAM), Cantoblanco, 28049 Madrid, Spain.
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08035 Barcelona, Spain.
| | - Josep Quer
- Liver Unit, Internal Medicine, Laboratory of Malalties Hepàtiques, Vall d'Hebron Institut de Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Universitat Autònoma de Barcelona, 08035 Barcelona, Spain.
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08035 Barcelona, Spain.
- Universitat Autònoma de Barcelona, Bellaterra 08193, Spain.
| | - Josep Gregori
- Liver Unit, Internal Medicine, Laboratory of Malalties Hepàtiques, Vall d'Hebron Institut de Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Universitat Autònoma de Barcelona, 08035 Barcelona, Spain.
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08035 Barcelona, Spain.
- Roche Diagnostics SL, 08174 Sant Cugat del Vallès, Spain.
| | - Juan Ignacio Esteban
- Liver Unit, Internal Medicine, Laboratory of Malalties Hepàtiques, Vall d'Hebron Institut de Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Universitat Autònoma de Barcelona, 08035 Barcelona, Spain.
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08035 Barcelona, Spain.
- Universitat Autònoma de Barcelona, Bellaterra 08193, Spain.
| | - Esteban Domingo
- Centro de Biologia Molecular "Severo Ochoa" (CSIC-UAM), Cantoblanco, 28049 Madrid, Spain.
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08035 Barcelona, Spain.
| |
Collapse
|