Everywhere and wherever you go, information is available whether you are accessing the information, which is taken to mean digitally, on a computer such as a personal computer, a laptop, or on a smartphone whether for looking up on a website or sending an e-mail. This form of technology has matured in the past 20 years as computer systems worldwide began to interconnect thus transforming every aspect of our lives but manipulating, storing, and accessing information is of course an example of the human use of technology, the use of information really is nothing new and even before computer technology or rather more generally information technology, life has been using information technology even before humans arrived and indeed life happens to be the first example of information technology where information such as how to reproduce and how to make a living is stored in the molecule DNA which functions both as a storehouse of biological information ranging from how the organism will function down to how the information is to passed on to generation to generation.
As the science of molecular biology which studies the molecular aspects of life mainly how cells and whole organisms store their hereditary information, in the form of nucleic acids, and how all the complex activities inside every organism is the result of proteins, began to mature, it was not unnoticed that apparently all of life was using information in the sequences of nucleotides or the building blocks of nucleic acids and that information includes how a sequence of amino acids or the building blocks of proteins are arranged in one particular order which allows that one particular protein to carry out a task such as breaking down food molecules for the organism in question to use. As molecular biology was maturing, so was computer science or the study of how computers use, manipulate, and store information in a sequences of zeros and ones which is how all digital computers of today function.
It was never a coincidence that information , in the sense of computer science, was already being used in life such as protein synthesis which depends on genetic information whereas in computers of today, all must function using software. Some scientists were then beginning to realize that what could be studied in computer science can apply to molecular biology.
The Nature of Information and its Relation to Biology
Just what is information exactly? There can be several definitions of the same word and depending on how it used and with so many definitions confusion can arise but for the sake of this argument, one definition of information can be given and that is any property of a system whether it is a physical system or an abstract system that can tell what it is, how it came to be, how it functions, and what it can do or more simply what it does and how it does it would be a good enough definition. As stated clearly by Loewenstein (1998) ” Information, in its connotation in physics, is a measure of order- a universal measure applicable to any structure, any system (pg. 6).” An example that would be present would be a sequence of 0’s and 1’s that code for a number or a word that could be easily read in a computer, another would be the sequence of amino acids in a protein molecule or any sequence of any material or abstract quantity with a given order. Indeed, information can be present in the abstract world of mathematics as well as in the world of the physical universe especially that part of the universe that I will specifically mention which is the biosphere and indeed with all species of organisms, large and small and various kinds that make up the biosphere and likely other undiscovered biospheres throughout the cosmos, each organism is related to one another through evolution, and how it makes a living and from the definition of information, any living system has the ability to reproduce which can be found in its cells, the basic building blocks of life and the fact that cells come in two main forms, prokaryote or simple cells and eukaryotic cells or complex cells and that the ability to reproduce comes from the molecular structure of DNA and that the information in DNA contains the instructions for proteins and with proteins a cell can do a lot and because of evolution by natural selection, cells can either remain unicellular or evolve into multicellular beings in the forms of fungi, plants, and animals. Notice that the description in biology nicely fits with the one definition of life that I’m describing and you can then sense the connection between information and biology.
Just like a computer has both hardware and software, the same can be argued for life and in biology there is something equivalent to hardware and software and that is the phenotype and genotype respectively. What do these terms in biology have in common with computer science?
If you have read one of my blogs where I give a definition of genotype and phenotype or if you have decided to look up terms in biology relating to my blogs then you would have no difficulty in making the connections between these two terms in biology to the other terms in computer science but for those of you who are unfamiliar I will give a definition. In biology, genotype are the genetic instructions that all organisms possese while the phenotypes includes the structure as well as function.
In computer science, software are the instructions on how a computer will function while hardware includes the physical components such as screen and mouse. There are similarities between these two set of terms and that is genotype is like software while phenotype is like the hardware.
What is the equivalent of software and hardware in the living world? We can go much further and say, without hesitation, that life has invented information technology in the form of the genetic code and central dogma , the major difference is that in computer science, which a human activity where the development of computer software is done with a goal in mind and that is a form of program that will allow a computer to do a given task but in evolution by natural selection, on the other hand, natural selection can only operate on what is going on the present and what has happen in the past but it cannot see ahead in the future and during the origin or possibly origins of life, the ancestor of all life, whatever it may have been, would have needed a genetic system of replicating molecules and the information to do so would have been the sequence of nucleotides, which according to one hypothesis, it would have been RNA , a form of nucleic acid that like DNA would store genetic information but unlike DNA, which only stores genetic information, RNA is also a molecule with catalytic properties in that it can even replicate itself without the need for protein enzymes so as both catalyst for replicating itself along as the storehouse of genetic information, natural selection, which can acts on the molecular level as well as on the cellular level would favor RNA molecules that can direct the information for creating enzymes for metabolism together with the ability of passing information from one generation to the next, while in each generation, mutations are also inevitable and this becomes the source of variation for natural selection to operate on, favoring those variants with superior abilities of information transmission while rejecting those that are unable to do so and this would have continued until DNA became the molecule that could store information while RNA was relegated to the role of transmitting genetic information for protein synthesis and this became the ancestor of all of life which uses DNA to store the information for all protein molecules including those that will replicate DNA.
So much for comparing genotype and phenotype to software and hardware respectively but there is more to the biological characteristic of information than just a given sequence of nucleotide basis. In the case of biology, a sequence of bases in a DNA molecule codes for a protein molecule and suppose I were to isolate the gene from one of body cells that codes for the blood protein hemoglobin which functions to carry oxygen from my lungs to all my body cells. After isolation, the gene is purified and crystallized and placed in a vial and is allowed to stand on a shelf. What can the DNA molecule really do? Absolutely nothing at all. If, somehow the crystallized DNA was reintroduced into my body, then it will end up expressing the information to make hemoglobin. From this example, the information for creating a functional protein molecule, hemoglobin in this case is present in both my cells as well as your cells but as long as the gene is in the living cells, the genetic information will be translated into protein and after the protein is synthesized it will have the task of delivering oxygen to the body’s cells. This is in contrast to the example of the gene crystallized into a vial which really does nothing at all. The information of the gene can only be relevant in the right context which in this case is the hemoglobin gene in the cell nuclei surrounded by protein synthesizing structures called ribosomes which use messenger RNA to carry the information of base sequences and together with another set of RNA molecules, transfer RNA which actually carry a given amino acid and with the ribosome, a protein molecule is synthesized.
This indicates that information has a broader aspect and in the case of the activities of that fundamental unit of life, the cells, the concept of information must be broadened to include the relation of information to where and when it can be used and there are three levels of information and these go by the names of syntactic, semantic, and pragmatic and using the example of the hemoglobin gene I will carefully explain how each level relates to molecular biology in a general way.
Beginning with the syntactic level it is nothing more than the order of a sequence of numbers, amino acids, nucleotides, or any thing whether physical or abstract arranged in a particular order. A polypeptide or a combination of amino acids, the building blocks of proteins and there are twenty amino acids in nature and any polypeptide can be formed from any given sequence of various amino acids. Likewise, for the polynucleotides or a combination of nucleotides the building blocks of DNA and RNA and since there are four nucleotides and in the case of DNA there are the bases adenine, thymine, guanine, and cytosine while in RNA the base uracil replaces thymine while the other three bases are still present and there can be many different sequences.
You can begin to sense that like in information technology where computers manipulate and store information in binary language which of course is 0’s and 1’s, biological information, in the form of proteins and nucleic acids, are based on various monomers or the building blocks which for proteins are the twenty amino acids and for the nucleic acids, the four nucleotides. Whatever computing device such as computer and laptop cannot function properly without software or the information to make it function, the biological equivalent of software would be the sequence of amino acids in proteins which is determined by the sequence of polynucleotides and without the correct sequence of polynucleotides or rather genes, there can be no proteins of a definite sequence and likely there can be no vital life processes such as metabolism which depends on protein molecules enzymes which speed up biochemical reactions and have a high degree of specificity for each molecule. Here is the problem since we are still considering the syntactic level of information; to have a protein molecule such as an enzyme that can break down a molecule such as glucose, the sequence of amino acids must be in the correct sequence and that is determined by the corresponding sequence of amino acids but there can be not one but many possible arrangements of nucleotide sequences and also there can be not one but many different possible sequences of amino acids in a polypeptide.
To make this even clearer, consider once again the hemoglobin molecule. The sequence of the molecule has been determined using sophisticated biochemical, biophysical, and computational methods and the hemoglobin molecule can function as oxygen carrier because of it’s sequence. It is also well known in the hereditary disease, sickle cell anemia, that a single mutant gene can code for a hemoglobin with a different amino acid and there will be a hemoglobin that is unable to carry oxygen resulting in the disease sickle cell anemia. The hemoglobin molecule consists of the 20 different amino acids and there are a total of 146 amino acids. I have mentioned that for a polypeptide such as hemoglobin, that there can not be one but many possible arrangements. How many in the case of hemoglobin or what are the possible alternate arrangements of polypeptide sequences of hemoglobin.
At the syntactic level since it is a sequence of symbols, whether that is the bases in genes or the 0’s and 1’s in computer memory, is it possible that the information can be quantified in the sense of predicting using a mathematical formula? As a matter fact yes but for only the syntactic level of information and the science that studies information, aptly named information theory was originally developed in then infant field of computer engineering which began in the late 1940’s and in 1948, an American mathematician Claude Shannon devised a formula that was based on the probability of sending a message that was similar to the original message and indeed the meaning of a message can be corrupted by noise which ranging from hearing static on telephone which can interfere with the meaning of the message ( in terms of meaning or the pragmatic level, which we will get to shortly) or rather the formula that Shannon devised measures the probability of the many ways of the message being arranged upon receiving and the formula is
- H= ΣpI=-Σpln(p)
In words, this formula means that for each message out of N total messages, each one arrangement of the symbols that defines a given message has an equal probability p of being found and the probability ( in the sense of finding any of the numbers 1,2,3,4,5,and 6 on a dice thrown about where all numbers have an equal chance of being presented on the top part of the dice after each toss) is multiplied by I, which is the information content of each message after receiving. In regards to probability of finding each message, any sequence is as probable as a sequence that conveys either information , in the sense that it may tell an observer something meaningful as well as messages that may be just pure nonsense. Once again consider the hemoglobin molecule. The polypeptide sequence is known but since it is composed of up to 146 amino acids, all 20 naturally occurring amino acids, and from the polypeptide sequence which is coded by a gene for hemoglobin according to the genetic code and from DNA to RNA, carrying the specific polypeptide sequence is then translated from nucleic acid language to protein language, the hemoglobin molecule is just one sequence out a vast number of possible sequences and out of this number of possible sequences one sequence is as likely as the other sequence and H or what is called the Shannon entropy, measures how probable one message over the other but the formula only deals with probabilities of messages; it says absolutely nothing about what the message means and what it can do? Interestingly enough, as the science of information theory began to mature, the fact that message can only mean something if there is someone or something to observe it or if the message was sent from a source to a receiver and if out of an ensemble of messages, a message with meaning does have an effect.
Out of the ensemble of many alternative polypeptide sequences, the sequence for a normal hemoglobin molecule has been found by natural selection for the survival of the organisms and hence survival of the species of organisms with that organism indicates that for a the sequence to be of any use, it must benefit the organism such as ourselves and according to information theory, information is as meaningful as long as it effects receiver, whatever the source and from this, it has been known that information flows from genes to proteins and in cells such as eukaryotic cells or cells with nuclei, the genetic information is present in the nucleus where it is the source and in the cell where proteins are made requires the information present in RNA so in cells there is already a source and receive which is in the cells themselves so to see how sequences can be meaningful we must go from the syntactic to the semantic level.
Any symbol in a sequence can have any probability but for a sequence to make some sort of difference depends not just on the arrangement of sequences but on the effect a sequence can have between two related levels, call one of these levels the “sender” and the other level “receiver” and it is the relation between these two levels that some sort of meaning can result in relation between the two, and now there is a new dimension of information called the semantic level.
The semantic level depends between two or more levels and this is where information now shows it effects. Starting from the syntactic level, which as you recall is nothing but the sequences of letters, marks on papers, amino acids in a polypeptide, and the probability of arrangements. On a simpler level, there are the microstates or the number and kinds of things, whether physical such as an amino acid in a protein or abstract such as a number or letter and with a large number of arrangements of any sequences, there is a transition from one level to another higher level and this higher level is called macrostate.
In between two levels, the relation between them makes the difference whether a message can have an affect or not and this depends on what the two levels have in common. To make this clear, we will use an example. Suppose you have a “sender” who is a person who speaks and a “receiver” who listens and takes action. Both sender and receiver have something in common and that is the use of spoken language and suppose that the sender sends out a message consisting of this one particular arrangement of symbols, which in this case are the 26 letters of the English alphabet
OG OT HET ITKHNEC
At the syntactic level, each arrangement in each separate grouping of words is as probable and hence this is just one arrangement out of many possible arrangements and let’s first consider this one particular sequence. Recall that at the level of microstate, are the fundamental constituents and from looking at this, segments of curves and lines form the letters of the Latin alphabet which of course are the letters, O,G, T, I, K, H, N,C, and E. These letters define the microstate. Letters combine into words which are then the macrostate but this one particular sequence looks at first like gibberish since this is written out and this is what sender says to the receiver. This one particular sequence then has no effect and the receiver does nothing at all. Why?
Both sender and receiver are using the same language and with knowledge of grammer and syntax, and a sentence like that, does not carry any meaning that would influence the future behavior of the sender but suppose that in another possible arrangement using the same letters as before we have
GO OT THE KITNHEC
The sender now recognizes two of the words which are “GO” and “THE” and even though these words are recognizable to the sender, the sender still has no idea if the new sentence is some sort of instruction or not, but here comes the other arrangements and something novel happens,
GO TO THE KITCHEN
This is a recognizable sentence and the sender now understands this to be a command and assuming that sender and receiver are inside a house, then the receiver carries out the order of going to the kitchen in the house.
Now from this example, there is a meaning that is apparent in that particular sequence out of many possible arrangements consisting only of the letters, G,T,O,H,E,I,C,N, and K for example. Both sender and receiver, those two examples of levels that relate to one another can easily relate to one because both posses a knowledge of grammar and syntax of spoken language but also experience of living in a place that both of them share and so information can aquire meaning and only meaning if information can have an effect on each of these levels.
This is true, not just in linguistics but in the field of molecular biology. We will see how using the two levels of sender and receiver, the problem now is when the sender is now DNA and receiver happens to be a part of the cell that is involved in protein synthesis, which are present in all life forms and that is a subcellular structure called a ribosome.
How does a ribosome, a subcellular structure, that is found in all kinds of cells, use the information of DNA to turn into a string of amino acids which will be the protein? First, the information in DNA must be present in the form of another nucleic acid, ribonucleic acid or RNA and there are several kinds of RNA and these are the kinds of nucleic acids that actually carry out the work whereas DNA is something of a repository of genetic information. The kind of RNA that carries information from DNA to the ribosome, is aptly named messenger RNA and along the length of the messenger RNA are the four bases, adenine, guanine, cytosine, and in place of thymine, uracil. Like the DNA, there are three bases that correspond to a given amino acid so and with the help of a special enzyme, RNA polymerase which synthesizes a strand of messenger RNA carrying the sequence of three bases for each amino acid. If along the gene there is a sequence of DNA bases that reads
AAT TTG GGC CAT GCC CCA
because of base pairing where A pairs with T and G pairs with C in the DNA molecule but for RNA where U for uracil replaces T, the corresponding messenger RNA would be
UUA AAC CCG GUA CGG GGU
Recall that the DNA is either located in chromosomes which are found in both eukaryotic cells and in prokaryotic cells. For the eukaryotic cells, DNA is wrapped up in special globular proteins called histones and it is this combination of DNA and protein that forms the chromosomes and that all the chromosomes are surrounded by a double membrane layer, called the nuclear envelope which is the really the cell’s nucleus. Prokaryotic cells, on the other hand, have no nucleus and there is no histones surrounding DNA, only the DNA molecule is tightly coiled inside the tiny volume that defines the prokaryotic cell.
As far as information is concerned, both the prokaryotic and eukaryotic cell has something in common and the DNA in both cells is the sender of genetic information. Messenger RNA carries the information for any polypeptide and the messenger RNA goes to the ribosome both of which are present in eukaryotic and prokaryotic cells. The ribosome, can then translate the information present in the messenger RNA by first using what is called ribosomal RNA which can read messenger RNA codon by codon and the ribosome after reading each codon, utilizes another form of RNA, called transferRNA and this kind of RNA is bonded to a given amino acid so with the ribosome, along with the messenger RNA and a transfer RNA bonded to amino acid is how the ribosome can translate nucleic acid language, composed of just four bases or rather four letters into the language of proteins, composed of 20 amino acids.
Also recall that using the analogy of sender and receiver, where both share a common language and if DNA in cells is the sender then the receiver are the ribosomes. What is the common language between DNA (sender) and ribosomes (receiver) in order for protein synthesis, as it is called, to occur? The answer has to do with the fact that all of life on earth, depends on a code which is appropriately the genetic code.
The genetic code relates the pairing between codons or the three letter bases in nucleic acid, whether DNA or RNA, to either one or several amino acids. Considering that there are four bases, and only three can either code for a single or several amino acids, how many possibilities are there for the coding between nucleotide and amino acids? It is 4^3=64 possibilities and that is there are some codons that can code for one amino acid but there are others that can code for more than one. The code is universal in all life forms ( although in reality there are some exceptions to this but this will not in any affect the main argument here) and at the level of the cell, the genetic code is the language understood between DNA and ribosome.
In addition there is a one way flow of molecular biological information that goes from DNA to RNA to protein, and like the genetic code, this is also universal in all of life and this is known as the central dogma, which simply states that information flow goes from sender (DNA) to receiver (ribosome) and through the ribosome, proteins are synthesized or more simply, DNA to RNA to protein.
To see how the genetic code operates between DNA and ribosomes, consider, for example, the codon TTT or Thymine Thymine Thymine. During transcription, where one section of DNA is read by RNA polymerase which synthesizes messenger RNA and with uracil present only in RNA, this would UUU for Uracil, Uracil, Uracil. According to the genetic codon, this only codes for just one amino acid, which is phenylalanine. As the messenger RNA makes contact with a ribosome, the ribosomal RNA reads the opposite codon as AAA or Adenine,Adenine, Adenine, which also corresponds to phenylalanine. That ribosome with the codon UUU now selects a transfer RNA with the codon UUU which has the amino acid phenylanine. Because of the common language that exists between the DNA and ribosome, a phenylalanine is then inserted into the growing polypeptide, and the ribosome can then add another amino acid via messenger RNA and ribosomal RNA. Thus, we can see , in addition to the syntactic level, which only deals with probabilities of symbols in a sequence such as the arrangements of codons in a polynucleotide such as DNA, there is the semantic level where there is a common language, the genetic code between a molecule which is DNA and along with an information channel, messenger RNA, and a molecular complex, the ribosome, there can be another property of life which is protein synthesis.
Thanks to the genetic code, any protein molecule can be synthesized, but DNA codes for so many kinds of proteins which allow for survival of the organism and from there is another important level of information and that is how the protein, in the context of the cell, and whether the organism is unicellular such as an amoeba or multicellular like us, allow important biological functions such as growth and adaptation and this would known as the pragmatic aspect.
If sender and receiver have a common language in common and if receiver understood what was sent then this will provoke an action to do something on the the part of the receiver. For the pragmatic aspect, this involves taking action only when both sender and receiver have a language in common. The same is true at the cellular level where the nucleus has the instruction needed for one particular protein and the ribosome, the receiver, uses messenger RNA to synthesize that protein which will end up affecting the cell in some way whether that protein could be an enzyme intended for use if the cell, in question, is an unicellular organism, or in a multicellular organism, such as ourselves, it is a hemoglobin molecule needed for oxygen transport.
In pragmatics, in order for information to carry out an action, there are at least two ways it could do this. The first is what is called confirmation and that is the message must have at least a degree of constancy that is there can be nothing completely novel or else information may not have an effect and the other is novelty where a message may differ to an extent that could elicit an response to the receiver to do something new, that was not present in the previous behavior of the receiver.
Cells reproduce and at the molecular level, DNA has the instructions to carry out it’s own reproduction and so in a step by step sequence of events, one cell becomes two, and so both cells carries the original information that was present in the previous cells. Here is where we can see the confirmation part of the pragmatic aspect of information. All the information needed to construct the cells is present in both cells so the two cells have all that is needed to carry out its normal functions but since replication in DNA is not really an exact process in that mutations in the sequence of genes is inevitable and if the mutations were high enough as to actually disrupt every base sequence in each gene, then it will likely disrupt the entire physiological function of the parent cell and it may not divide, so in order to ensure faithful transmissions of genetic information, cells have evolved a set of special enzymes called “proofreading” enzymes to make sure that in the process of DNA replication , errors caused by mutations are kept at a minimum, so we can then see the confirmation aspects.
Of course, there is really is no perfect transmission of information, and that errors in transmission of information will occur, a fact that was well known to Shannon and also made evident in findings in molecular biology in that no matter how careful the proofreading enzymes are in keeping mutations in checks, there will be some progeny with a slightly different sequence in DNA. Those offspring with a slight difference in genotype may have end up with a novel enzyme that may confirm an advantage that its parents did not have and if it is an enzyme that allows efficient synthesis of protein molecules, then it is likely to be favored by natural selection and so there will be a brand new life form with an ability to survive in its environment. Here, novelty in the sense of a brand new form of information that effects the survival of future organisms. For the evolution of life, there has to be a balance between confirmation and novelty for if there was only confirmation but no novelty, then considering that environments have changed, organisms with perfect replication of genetic information are less likely to survive any changes whereas novelty, in the form of rapid mutations in between generations and any drastic mutations will likely damage biological systems. Evolution of life is then a balance between novelty and information which defines life’s pragmatic aspect.
Although, the computers of today are based on software that have been developed since the late 1940’s when the first electronic computers were developed, life has been using software for about 3.8 billions of years. Life indeed has used a form of information technology and I have written about definitions of life, one of these definitions that the organization of life is made possible by the one software, the genetic code , along with an information channel, the central dogma and together with the three dimensions of information, which life, at the molecular level, uses so well, the information aspects of life is clearly appreciable.
Aldridge, S (1996) The Thread of Life: The Story of Genes and Genetic Engineering Cambridge, England: Cambridge University Press
Computer Science (n.d) Retrieved on June 28 2016: https://en.wikipedia.org/wiki/Computer_science
Frank-Kamenetskii (1996) Unraveling DNA: The Most Important Molecule of Life (Liapin, L, Trans) Reading, MA: Addison Wesely( Original Work Published 1993)
Gribbin, J(1985) In Search of the Double Helix: Quantum Physics and Life London, England: Penguin Books Ltd
Hemoglobin (n.d) Retrieved on October 4 2016 https://en.wikipedia.org/wiki/Hemoglobin
Hemoglobin (n.d) Retrieved on October 4 2016 http://www.proteopedia.org/wiki/index.php/Hemoglobin
Küppers, O,B (1990) Information and the Origin of Life (Scripta,M, Trans) MIT press
Information (n.d) Retrieved on June 28 2016 https://en.wikipedia.org/wiki/Information
Information Technology (n.d) Retrieved on June 28 2016: https://en.wikipedia.org/wiki/Information_technology
Lesk, A. M (2008) Introduction to Bioinformatics (3rd Edition) Oxford, England: Oxford University Press
Loewenstein, W,A (1998) The Touchstone of Life: Molecular Information, Cell Communication, and the Foundations of Life New York, NY: Oxford University Press
Yockey, P. H (1992) Information Theory and Molecular Biology New York, NY: Cambridge University Press
Enzymlogic DNA https://www.flickr.com/photos/101755654@N08/9735192821/in/photolist-ae1zYZ-7TqgUV-qZChgN CC BY-SA 2.