Sydney Brenner recently described the radical revolution in life sciences in the early 1950s: the occupation of biology by quantum mechanics examining the fundamental questions of matter and energy followed by the rise of genetics that showed that chromosomes were the carriers of genes. The discovery of the double helix resulted in the new paradigm that information is physically embodied in DNA sequences of four different types. In contrast to the years before 1953, the question of “information” now became central: the components of DNA are simple chemicals, but the biological complexity that can be generated by the information of different sequences is revolutionary. The fundamental concept that integrated this new biological “information” with matter and energy was enshrined in the universal Turing machine and von Neumann‘s self-reproducing machines[2-4]. Consequently it follows that biology is physics with computation. This was the core paradigm of molecular biology for almost the next half-century. The crucial step in the serious discussion of “information” as an essential part of definitions of “life” was taken by Manfred Eigen.
MANFRED EIGEN COMBINES PHYSICS, CHEMISTRY, MATHEMATICS AND INFORMATION THEORY
In a series of articles and books Manfred Eigen developed a model of how the essential features of life and its inherent complexity can be explained by physical properties of matter[6,7]. If certain chemical properties exist on a planet and certain physical conditions obtain, life will start by self-reproducing macromolecular cycles which act in a complementary way. On the one hand there are “information”-carrying nucleic acids which build a reproductive cycle. On the other there are functional amino acids which build the protein bodies. Both code-systems together can build a catalytic “hypercycle” which is the basis of the self-reproductivity of life.
Both parts can be reconstructed physically. Nucleic acids (information) and proteins (function) represent a closed system, because there is no function without information, and information gets meaning from function. “Mutations” are replication errors with selective advantage, i.e., instabilities in this system represent irreversible thermodynamic processes. A series of such mutations in nucleic acid sequences leads to quasi-species that are mutant distributions of primitive replicating entities. Such dynamic distributions of genomes that share genetic variation, competition and selection generate the fittest types (“master copies”) and therefore avoid “error thresholds”, i.e., excessively high mutation rates, in that information cannot further reproduce. The resulting evolution of life is an optimising process in that Darwinian selection evaluates the fittest results of mutations[8,9].
Manfred Eigen adapted the concept of Hammings “sequence space” to explain hypercycle concept by physical properties of matter. Similar to Hilbert space, in which every ontological entity could be defined by an unequivocal point in a mathematical axiomatic system, in the abstract ‚information space” concept each point represents a unique syntactic structure and the value of their separation represents their dissimilarity. In this concept molecular features of the genetic code evolve by means of self-organisation of matter. Each point in the sequence space can be occupied by one of four different nucleotides. But each point can also be represented by digital computation (1 and 0).
BIOLINGUISTICS, BIOINFORMATICS, SYSTEMS BIOLOGY AND SYNTHETIC BIOLOGY
A series of varieties of mathematical theories of language emerged such as Biolinguistics, Bioinformatics, and systems biology. They all interpret and investigate genetic structures in the light of linguistic categories as quantifiable sets of signs[10-12] and use statistical methods and algorithms to identify genetic sequence orders.
An emerging hybrid of information-theoretical aspects of nucleic acid language is synthetic biology. Its theoretical assumptions clearly derive from systems biology and information theory and generally from a mathematical theory of language. Proponents of synthetic biology want to deconstruct complex biological systems into its parts and artificially reconstruct and even evolve biological systems. This kind of artificial molecular design could serve as an appropriate tool in genetic engineering for, e.g., new vaccines, immune functions, etc. This rather mechanistic concept depends on syntax structure identification that represents meaning/function. The context-dependent epigenetic imprinting which represents a deep grammar hidden in the superficial grammar of nucleic acid sequences is not the focus of synthetic biology approaches. In contrast to predominant genetic engineering synthetic biology tries to construct complex biological systems which are then subject to selection processes. They are expected to be mutation-resistant in a certain sense.
UNEXPECTED EARTHQUAKE IN THE FOUNDATIONS OF MATHEMATICS
The original mainstream assumptions regarding the several mathematical theories of language are still present in concepts, curricula and even the definition of life and animated nature[14,15]. The conviction of an exact science based on exact definitions of scientific sentences in contrast to non-scientific ones is at the basis of scientific communities and their self-understanding.
The history of science or even sociology of knowledge evidences the interesting fact that it is still largely ignored that 50 years ago the basis of this world view was shaken to the core. The belief that mathematics was the best tool for depicting the physical reality of matter and natural laws marginalised world views other than mathematical ones[16-19].
In his Unvollständigkeitssatz (incompleteness theorem) Manfred Gödel investigated a formal system converting a meta-theoretical statements into an arithmetical one. He strove to convert the statements formulated in a meta-language into the object language S. This led Gödel to two prominent and critical conclusions: (1) If system S is consistent, then it will contain at least one formally indeterminable sentence. This means that onev sentence is inevitably present that can be neither proved nor disproved within the system; and (2) If system S is consistent, then this consistency of S cannot be proved within S.
The consequence of the incompleteness theorem for the automaton theory of Turing and von Neumann was significant: a machine can principally calculate only those functions for which an algorithm can be provided. Sign-mediated interactions between living organisms in which the meaning of the signs depends on real life circumstances relay on non-formalizable sequence generation, for which no algorithm can be provided. Essential functions of every natural language, such as non-formalizable features are not object of algorithm based calculations. Living organisms are no machines.
PRAGMATIC TURN IN BIOLOGY: NATURAL GENETIC CONTENT OPERATORS EDIT GENOMES
Manfred Eigen’s concept of natural languages/codes and the current concepts embraced by bioinformatics, biolinguistics, systems biology and synthetic biology are not coherent with current knowledge about key features of natural languages or codes, i.e., the three levels of rules that govern natural code use by competent code-using groups: combinatorial rules (syntax), contextual rules (semantics) and context-dependent rules (pragmatics). In all mathematical theories of language the syntax determines semantics (function), but in natural codes pragmatics (context) determines semantics. Pragmatic rules do not exist in Eigen’s concept. Natural code-inherent rules are absent in abiotic matter that is determined strictly by natural laws: no syntax, pragmatic or semantic rules are present if water freezes to ice. Therefore the explanation of the evolution of biological macromolecules in Eigen’s concept as well as in other mathematical theories of language cannot explain the evolution of natural codes and its inherent rules[22-25].
RNAS THAT ORGANIZE GENETIC CONTENT COMPOSITION
The change from a read-only-memory genome with copying errors to a read-and-write genome with active change operators is fundamental. In contrast to the decades long assumption that the driving forces of evolution were chance mutation (statistical replication errors) and selection it is now recognised that although mutation is an empirical fact it does not contribute very much to genetic novelty. Key roles now act as non-random genetic change operators in the production of complex evolutionary inventions[26-28].
Now we can investigate several key players that organise the genetic content compositions of host organisms such as, e.g., endogenous viruses and its defectives, transposons, retrotransposons, LTRs (long terminal repeats), non-LTRs (non-long terminal repeats), LINEs (long interspersed nuclear elements), SINEs (short interspersed nuclear elements), ALUs, group I introns, group II introns, phages, plasmids[29-31]. We now recognize that DNA is not solely a genetic storage medium but is also a kind of ecological habitat. Many of such mobile genetic elements have been found within the last 40 years as inhabitants of all genomes[32-35]. Some cut and paste, others copy and paste and both spread within the genome. They modify host genetic identities through insertion, recombination, or the epigenetic regulation of genetic content. They co-evolve with the host, interact in a modular manner and additionally generate highly adaptive immune systems for host organisms from the simplest prokaryotes (CRISPRs/Cas system) to the most complex eukaryotes (VJD-Systems). Such mobile genetic elements shape both genome architecture and regulation. Therefore they are agents of change not only over evolutionary time but also in real time as domesticated agents[36-38].
FROM MOLECULAR BIOLOGICAL ENTITIES TO SOCIAL GROUPS
The question arises how these RNA populations, it’s closely related RNA viruses and their complex interactions can be explained and understood without mathematical theories of language. How should we investigate non-coding RNA interactions, competencies and even their role in epigenetic imprinting without formalisable tools? This world of life processes is dominated by RNA, whereas DNA remains a habitat, an ecosphere of interacting RNAs that behave like inhabitants and as genetic information storage[39-43].
All these these RNAs, share a secondary structure like a hairpin, or a stem-loop. In more complex ligated consortia of such stem loops we can look at tRNAs, or ribosomal subunits, RNA polymerases or a great variety of RNA viruses and its defectives as listed above. The RNA stem loops have two characteristic parts: stems that consist of base-paired nucleic acids and loops, bulges and junctions that consist of unpaired regions limited by stems. Most interesting from an evolutionary perspective are two recently found key features[44-46]: (1) Randomly associated RNAs that have no evolutionary history show the same structure-dependent compositional bias as ribosomal RNAs. This means that the differences do not depend on selection processes but on the overall composition of the RNA consortium; and (2) The singular RNA stem loop behaves like a random assembly of nucleotides without selective forces and underlies physico-chemical laws exclsusively Only if stem loops build groups, they share a culture of interactional patterns and a history of defined timescales, i.e., they underlie biological selective forces.
This looks like the true split of life and non-life processes. To better understand behavioral motifs of RNA stem-loop swarms and viruses, one should add group membership features that are absent in the inanimate world. The basic tool of such RNAs is their complementary composition of base-pairing stems and not base-pairing loops, the result of an inherent property of RNA chemicals, the foldback of polyRNAs. The variety of regulations on protein coding genes as well as the processing of these regulatory RNAs by phases of splicing and editing RNA transcripts makes its algorithm based predictability nearly impossible because of its complexity.
These populations of RNAs share properties with RNA viruses, which have defined capabilities. In contrast to DNA viruses RNA viruses have much smaller genomes on RNA bases without proofreading and repair. In contrast to the previous perspective (mutation, i.e., replication error) the new perspective assembles the property of invention of new sequence contents, de novo, that have not existed before and for which no algorithms are available in principle. This is important for variation and innovation, as well as infection, immunity and identity, for both diversified viral and cellular populations[47-50].
RNA STEM LOOP STRUCTURES CONSTITUTE LIFE
This change in perspective from molecules to agent-based behaviour will look at interactions of RNA viruses, DNA viruses, RNA-DNA viruses, viral swarms, and sub-viral groups like any ligated RNA stem loop groups that cooperate and coordinate (regulate) within cellular genomes as replication-relevant co-players[51-53]. Or they interact as suppression-relevant silencers or as infection-derived modular tools of non-coding RNAs that have built consortia of complementary agents that function together such as retrovirus-derived remnants, such as LTRs, non-LTRs, group II introns, rRNAs, tRNAs, spliceosomes, editosomes, and other counterbalancing modules[54-58]. Such populations determine regulations in many ways and may newly adapt different functions. The use of a natural language or code depends on consortia of living agents, because natural languages and codes function according to rules. In contrast to the inevitability of natural laws rule-following is a feature of social interaction and not solely one of physico-chemical necessity.
Investigating syntactic sequences without knowing something about the real-life context of code-using agents is senseless because syntactic structures do not represent unequivocally semantic meaning. Quantifiable analyses of signs, words and sequences cannot extract meaning. Only in a restricted (statistical) sense this is possible through sequence comparison.
EVOLUTIONARY GENETIC INVENTION IS NOT THE RESULT OF REPLICATION ERRORS
The virosphere in particular exemplifies how genetic innovation derives from novel nucleic acid sequences and their combination. If cells are infected by more than one virus, the genomes of different viruses are copackaged into the viral progeny. During reverse transcription the reverse transcriptase switches between two or more templates, generating a new DNA sequence. Similar sequence generations are known in various co-infection events such as the combination of external RNA viruses and persistent endogenous retrovirus, infectious RNA viruses with former viruses, retaining defective parts which can be combined into new sequence orders of still functioning viruses[52,53]. Interestingly, not only viruses generate de novo, or combine and recombine sequences. With this innovation competence quasispecies-consortia (qs-c) transfer this adaptive principle also to all forms of cellular life. The defective parts of infectious genetic parasites represent an abundance of appropriate tools for cellular needs, documented in the variety of non-coding RNAs which are essential actors in all stages of cellular life such as transcription, translation, repair, recombination and immune functions[60-65].
REMEMBER GÖDEL: NATURAL CODES ARE OPEN “SYSTEMS”
RNA group membership can be described by its various features. But this membership can never be completely specified, since it can always be further parasitised by unknown and even unpredictable parasites. This essential feature renders the ability to specify membership absolutely impossible. Additionally this means absolute immunity in this open “system” is impossible in principle. This “insecurity” provides the inherent capacity for novelty, that is, the precondition for greater complexity. It seems we are here at the core competence of variation the essential feature for biological selection.
How do agents emerge from ribozymes to form identity of replicators and then form groups that learn membership? The emergence of single RNA stem-loops solely depends on physico-chemical properties. As mentioned above, if stem-loop groups build complex consortia biological selection and social interactions emerge that are not present in a purely chemical world[44-46]. This looks coherent with the results of sociology and the evolution of natural languages. Natural languages and codes depend on competent agents that followsemiotic (syntactic, pragmatic, semantic) rules, and rule-following are social interactions. This means one agent alone cannot generate or follow a rule. Evolution of identity implies emergence of self/non-self differentiation competence. This is a crucial step from single RNA stem loops to RNA stem loop groups[28,36,45,46].
RNA GROUP BUILDING: CONTEXT DETERMINES MEANING
If we apply some interactional motifs of RNA agents to form biotic structures that follow biological selection processes and not mere physico-chemical reaction patterns we must also look at the group-building of RNA stem-loop structures.
As previously mentioned it has been found that single stem loops react in a purely physico-chemical reaction mode without selective forces, regardless of whether they derived randomly or are constructed under in vitro conditions[39,46]. Conversely, if these single RNA stem loops build groups they overrule pure physico-chemical reaction patterns and emerge as biological selection forces: biological identities of self/non-self recognition and preclusion, immune functions, dynamically changing (adapting) membership roles. A single alteration in a base-pairing RNA stem that leads to a new bulge may dynamically alter not only a single stem loop but the whole group identity from which this stem loop containing the newly emerged bulge derives[39,46].
Simple self-ligating RNA stem loops can build much larger groups of RNA stem loops that serve to increase complexity. This may lead to ribozymatic consortia, which later on build success stories, such as the merger of the two subunits of transfer RNAs or RNA-dependent RNA-polymerases for replication of RNA through RNA or the subunits of ribosomal RNAs, all of them being former groups that evolved and functioned for different reasons than those applicable to subsequent conserved modes[67-69].
If RNA fragments self-ligate into self-replicating ribozymes they constitute networks. For example, three-membered networks represent highly cooperative growth behavior. If such networks compete directly with selfish autocatalytic cycles, the former grow faster. This clearly indicates the ability of RNA populations to evolve into higher complexity through cooperation which clearly outruns selfishness.
Another intriguing example of the biological (selective) group-building competence of RNA stem loop consortia is the chemical interaction based on the molecular syntax in stem-loop “kissing”, in that single-stranded regions of RNA stem loops bind according to Chargaff rules to other single-stranded stem loop structures to unite and build more complex group identities for several functions, such as dimerisation of genomic RNA in viruses, e.g., HIV 1. Such complementary interactions are also important in RNA replication of the hepatitis C virus[70-72].
Complex three-dimensional structures can be built by consortia of single RNA sequence strings. One of the most interesting structures is the pseudoknot composed of two helical segments connected by single-stranded regions or loops. Bases in the single-stranded loop are base-pairings with bases outside the loop. This interaction pattern clearly depends on the rules of molecular syntax but is initiated for adaptational purposes by different ecosphere habitat dynamics. So the results of these interactions may lead to structurally diverse groups with important different biological roles such as the catalytic core of key players of the present RNA world, i.e., ribozymes, self-splicing introns, telomerase and its context-dependent altering gene expression by inducing ribosomal frameshifting in several viruses[73-75].
Most interestingly, the base-pairing in pseudoknots is strictly context-sensitive and base-pairs overlap with one another in sequence positions. This leads to the limits of algorithm-based prediction models such as dynamic programming or stochastic context-free grammars. This indicates the natural language nature of nucleic acid code3aw which represents the possibility of coherent de novo generation and context-dependent alterations for a diversity of different meanings (functions) for the same syntax structures.
How long will biology remain a subdiscipline of physics and chemistry? As I have tried to demonstrate, the investigation methods of natural languages/codes such as the genetic code (in terms of both its superficial syntax and the deep grammar hidden as a result of epigenetic imprintings) in the light of mathematical theories of language and its derivatives such as biolinguistics, bioinformatics, systems biology and synthetic biology can lead to quantifiable, i.e., statistical, results which can be compared, measured and computed. The question remains whether it is sensible, to measure, investigate and compare the wavelength and modulations of phonetic utterances of humans to extract a meaning? Can we extract semantics from investigations of certain features of syntax structure?
In natural languages/codes it is not the structure of syntax that determines the meaning of sequences. In nearly all cases it is the hidden deep grammar which determines meaning for the recipient of the message. The deep grammar depends on how the superficial syntax is marked: in the genetic code by epigenetic imprintings or in sign sequences of utterances by gestures and emphasis. In all cases the hidden deep grammar decides whether a competent recipient can understand the intended meaning of the sender or not.
The real-life world in which natural sign users are included decides the meaning of a natural language or code, not the in vitro experimental set ups, the universal grammar or similar algorithm-based components. In contrast with previous approaches the real action between interactors determines what signs of communication and coordination are used to express what should be transported, what is intended, and what is focused. The real actions are the driving force of content and represent the context which determines the meaning of thoughts and interpretations. Therefore pragmatics is of essential relevance to identify the meaning of natural languages/codes, not syntax or semantics.
This aspect is missing completely in Eigens concept of a sequence space in which each nucleotide sequence occupies a unique position that can be computed by digital units. Because each nucleotide sequence can have several meanings, depending on the contextual use, sequence space position can not explain the variety of its functions.
This means the mathematical concept of language and its derivatives is based upon a fundamental error. Natural languages/codes are not the core objects of natural sciences because the latter”s tools for appropriate investigations are rather limited and cannot lead to a full explanation or understanding. As a consequence we need a pragmatic turn in biology to liberate this discipline from its role as a subdiscipline of physics and chemistry.