home news forum careers events suppliers solutions markets expos directories catalogs resources advertise contacts
 
Solution Page

Solutions
Solutions sources
Topics A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  Species
 

Reading the language of nature: Applying AI-augmented algorithms to enriched metagenomic databases


November 28, 2022

With our current drive to preserve and maintain our planet in the midst of climate change, and even reach beyond its boundaries into space, having a better understanding of our world’s biological building blocks – proteins – is key.

Proteins compose an essential part of our biological world – providing structural, enzymatic and other types of functions, in our human bodies, in animals, and in the environment. Having a deep knowledge of such fundamental molecules is therefore critical to understanding our world, and how we can help maintain and protect it.

Characterization of protein function in laboratories is expensive, laborious and time consuming. Alternatively, we can attempt to decipher protein function through sophisticated digital analysis techniques. One such method is Google’s Proteinfer [1], which uses deep convolutional neural networks to predict protein functions directly from primary amino acid sequences. Other approaches involve the prediction of 3D structures, based on one of the basic premises of molecular biology – that structure indicates function. 

Using machine learning and AI, DeepMind has been able to develop deep learning algorithms for protein structure prediction, such as AlphaFold and AlphaFold2, which incorporate evolutionary information about proteins using deep multiple-sequence alignments [2]. Such algorithms possess close to experimental accuracy in their predictions [2], involving novel neural network architectures and training procedures that draw upon the evolutionary, physical, and geometric constraints of protein structures. However, pipelines such as AlphaFold2 and RoseTTAFold [3] are computationally intensive and can take time to determine the 3D folding prediction for a single protein. 

Using a large language model that can learn evolutionary patterns and generate accurate structure predictions, end to end, directly from a protein sequence, Meta’s ESMFold is reported to significantly increase analysis speed, making predictions 60 times faster than previous algorithms while still offering experimental-like accuracy. ESMFold has been used to analyze vast metagenomic data sets, providing fold predictions for over 600 million proteins, including those derived from soil, deep ocean, and human microbes [3,4]. These sequences have been generated by whole genome shotgun sequencing experiments and are stored in databases hosted at public institutions like the NCBI,  European Bioinformatics Institute and Joint Genome Institute [4].  

Such metagenomic databases are essentially unchartered territory, and we still do not understand all the useful information that might be contained within them. As Meta states about the protein structure predictions that they have provided thus far – “they are the least understood proteins on Earth” [4].

Such databases will be significant in terms of providing bio-inspired innovations that could help prevent, or even help reverse, the effects of climate change, for example, or help cure patients of chronic diseases that at present offer no other cure. These biologically-driven advances have also been described by McKinsey & Company as being part of the ongoing Bio Revolution.  

Micro-organisms such as bacteria can survive almost any extreme environment on Earth, from deep-sea vents, where temperatures can be up to 400°C through to very high elevations, cold weather conditions, permafrost, acidity, and extremely low organic matter content. The latter type of environment can be found around the Ojos del Salado volcano and its crater lakes, in the Andes Mountain range on the Argentina-Chile border, which is considered to be a terrestrial astrobiological analogue of Mars [5]. The adaptability of micro-organisms means that, in some cases, metabolism can even rely on sulphur-based biochemistry as an alternative to oxygen (phototrophic sulphur bacteria) [6]. It follows, therefore, that the microbial proteins produced by such micro-organisms would offer similar, ‘extreme’ properties. Other interesting species with the ability to survive in space include tiny organisms called tardigrades (also known as water bears or moss piglets), or lichen, symbiotic organisms composed of species from two separate kingdoms, fungi and algae.

Coupling structure and function prediction with information about a protein’s source environment can help identify proteins that can perform their roles in harsh environments, such as very hot temperatures, or even proteins that could survive and function in space, outside of Earth’s protective atmosphere. 

To enable such discoveries and research, the European Nucleotide Archive (ENA) currently hosts many thousand metagenomic datasets. The MGnify resource, based at EMBL-EBI, performs analyses on this publicly available data, assembling the nucleotide sequences and predicting genes and proteins. It makes this data available, along with information about the source of the proteins, including the environmental conditions in which they are found, providing an open-access hub that allows the exploration and analysis of key microbiome data [7]. 

Drawing upon the data generated by publicly funded research and hosted by public research institutions, along with advanced analysis algorithms, such as those developed by Google and Meta, we now have the key components in place to delve into the uncharted metagenomic protein world and answer important questions about how proteins perform their roles in particular environments. What temperature, pH, and atmospheric pressure have they been exposed to, or survived in? Which proteins interact together under these different conditions? Such information will be vital to those trying to make sense of and unlock the scientific and commercial potential of metagenomic protein data. This includes organizations investigating such proteins for specific applications – whether it is in Beauty and Personal Care, AgBio, Food and Nutrition, or BioPharma. An improved understanding of such proteins could also help us better understand how to maintain and protect our planet from a One Health perspective – including the health of humans, animals, and the environment.

Eagle Genomics’ AI-augmented knowledge discovery platform, e[datascientist]™, provides approaches that enable such data interpretation and learning. The platform uses multilayer hypergraphs to structure and interrogate data, applying AI to network science to derive data-driven insight journeys into complex problems at scale. By applying such technologies to the metagenomic protein space, we hope to achieve impact-orientated innovation outcomes, unlocking the vast potential of this data. 

References 

  1. Sanderson, T., Bileschi, M.L., Belanger, D., Colwell, L.J. ProteInfer: deep networks for protein functional inference. bioRxiv09.20.461077; doi: https://doi.org/10.1101/2021.09.20.461077.
  2. Jumper, J., Evans, R., Pritzel, A.  et al.  Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021); doi: https://doi.org/10.1038/s41586-021-03819-2
  3. Lin, Z., Akin, H., Rao, R. Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv07.20.500902; doi: https://doi.org/10.1101/2022.07.20.500902.
  4. Meta AI. ESM Metagenomic Atlas: The first view of the ‘dark matter’ of the protein universe. Available at: https://ai.facebook.com/blog/protein-folding-esmfold-metagenomics/[accessed on 28th November 2022]
  5. Aszalós, J.M., Szabó, A., Felföldi, T., et al. Effects of Active Volcanism on Bacterial Communities in the Highest-Altitude Crater Lake of Ojos del Salado (Dry Andes, Altiplano-Atacama Region). Astrobiology. 2020 Jun;20(6):741-753. doi: https://doi.org/10.1089/ast.2018.2011
  6. Frigaard, N.U., Dahl, C. Sulfur metabolism in phototrophic sulfur bacteria. Adv Microb Physiol. 2009;54:103-200; doi: https://doi.org/10.1016/S0065-2911(08)00002-7.
  7. Mitchell,L., Almeida, A., Beracochea, M., et al. MGnify: the microbiome analysis resource in 2020, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D570–D578; doi: https://doi.org/10.1093/nar/gkz1035.
  8.  


More solutions from: Eagle Genomics Ltd.


Website: http://www.eaglegenomics.com

Published: November 28, 2022


Copyright @ 1992-2024 SeedQuest - All rights reserved