Biopython: Comparing the DNA Polymerase I (polA) Gene of Thermophilic, Mesophilic, and Psychrophilic Bacteria

Biopython is a specialized Python tool for computational molecular biology. Various computational molecular analysis that can be performed using Biopython, such as reconstructing phylogenetic trees, multiple sequence analysis, generating complementary sequences, counting amino acids, etc. This technical notes paper describes in detail the procedures computational DNA sequence analysis using Biopython. The DNA polymerase I (polA) gene sequences of bacteria were used in this study to compare the differences between Thermophilic, Mesophilic, and Psychrophilic bacteria. This is an open access article distributed under the Creative Commons 4.0 Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2021 by author.


INTRODUCTION
Computational biology in the 21st century is growing rapidly, reaching almost all areas of biological sciences. (Markowetz, 2017) said that "all biology is computational biology", biological knowledge nowadays is analyzed, defined, organized, and accessed through computation. Computational biology can be used to solve problem in bioinformatics, such as analyzing the large data of genetic code. Analysis in bioinformatics mainly focuses on three types of big data sets available in molecular biology, e.g. Genome Sequences (Bayat, 2002;Luscombe et al., 2001;Markowetz, 2017).
There are several programming languages that are commonly used in bioinformatics, such Python, R, Java, C++, and Bash (Fourment & Gillings, 2008). Python is the most popular among them as there is a Python package that has been developed specifically for biologist, namely Biopython (Chapman & Chang, 2000).
Biopython is a specialized Python tool for computational molecular biology. Various computational molecular analysis that can be performed using Biopython, such as reconstructing phylogenetic trees, multiple sequence analysis, generating complementary sequences, counting amino acids, finding specific sequences, etc DNA polymerase I (polA) gene sequences of several bacterial species from different groups in order to understand the characteristic differences between them genetically, as there are several bacterial species with capability to survive in extremes of temperature and pressure. Based on their ability to survive in an environment with a certain temperature range, bacteria can be classified into 3 three groups, namely Thermophilic, Mesophilic, and Psychrophilic (Chen & Berns, 1980;Shing et al., 1975). These three bacterial groups are different genetically.
DNA polymerases are enzymes needed for DNA replication every time cell division. These enzymes form two identical DNA strands by assembling nucleotides (Building Blocks of DNA) based on one original DNA molecule (Garcia-Diaz & Bebenek, 2007;Lehman & Uyemura, 1976;Loeb & Monnat, 2008;Yoon et al., 2014). The essential role of DNA polymerases is to precisely and effectively recreate the genome to ensure the maintenance of the inherited genetic information through generations (Allen et al., 2011;Garcia-Diaz & Bebenek, 2007).
Specific DNA polymerases such as Taq Polymerase play a key role in the PCR (Polymerase Chain Reaction) process (Chien et al., 1976;Ishino & Ishino, 2014). Taq polymerase is a heat-resistant and thermostable DNA polymerase that was originally isolated from thermophilic bacteria, Thermus aquaticus, in 1976 (Chien et al., 1976). T. aquaticus is a bacterium that lives in hot springs and hydrothermal vents, this is the reason that Taq polymerase is an enzyme that is tolerant of high temperatures.
This technical notes paper describes in detail the procedures computational DNA sequence analysis using Biopython. The DNA polymerase I (polA) gene sequences of bacteria were used in this study to compare the differences between Thermophilic, Mesophilic, and Psychrophilic bacteria.

DNA SEQUENCE ANALYSIS USING BIOPYTHON 2.1. DNA Polymerase I (polA) Gene Sequences
The DNA polymerase I (PolA) gene sequences of Thermophilic, Mesophilic and Psychrophilic bacteria are downloaded from NCBI directories. The detailed information about the DNA polymerase I gene sequences used in this paper are presented in Table 1.

DNA Sequences Analysis
The analysis of DNA polymerase I gene sequences are performed using Biopython computationally. Genetic properties analyzed in this technical notes paper are: GC-Content and Amino Acids. The detailed information regarding to the software used is presented in Table 2. The following procedures of DNA sequences analysis consisting of Python scripts that are adopted from official Biopython Tutorial and Cookbook (Chang et al., 2020).

Calculate GC-Content
GC-content is the percentage of nitrogenous bases in DNA or RNA consisting of guanine (G) and cytosine (C) and is one of basic parameters commonly used to describe genomes (Gao & Zhang, 2006;Piovesan et al., 2019). There are 2 (two) methods of using Biopython scripts that can be used to calculate the GCcontent of the gene sequences. Bio.Seq and Bio.SeqUtils are the Biopython modules used to calculate the GC-content of a DNA sequence. The following example shows the application of the scripts above to calculate the GCcontent.  Figure 2 shows the results of GCcontent analysis using Biopython scripts. The DNA sequence is a sequence of DNA polymerase I (polA) gene of Thermus thermophilus. The sequence ATG....TAG that presented above is the shortened of the complete sequence in order to simplify the example. Based on the results, the polA gene of T. thermophilus has 67.86% GCcontent of a whole sequence.

Amino Acids
Amino acids are organic compounds that combine to form proteins as the product of the gene expression. Amino acids have a role as the nitrogenous backbones for compounds like enzymes and hormones (LaPelusa & Kaushik, 2021;Lopez & Mohiuddin, 2021;Wu, 2009) (Rother & Krzycki, 2010;Wu, 2009;Yuan et al., 2010;Zhang et al., 2005). The following Biopython scripts can be used to analyze and calculate amino acids from the DNA sequence. Transcription ................................................. (1) >>> from Bio.Seq import Seq  The scripts above represent the common process of Protein Synthesis which consists of 2 (two) main phases, namely C. Adam 14 Transcription and Translation. Transcription is the process of copying DNA segments into mRNA by the enzyme RNA polymerase. mRNA (messenger RNA) consists of genetic codes called codons, sequences of trinucleotides that correspond to specific amino acids. Translation is the process of translating mRNA codons into amino acids to form proteins (Pánek et al., 2017;Smith, 2020;Taylor, 2006). The following example shows the application of the Transcription and Translation Biopython scripts to generate mRNA and protein sequences.

Figure 4. Transcription and Translation using Biopython Scripts
The amino acid composition calculation is performed after the protein sequence has been generated. The basic principle of calculating the amino acid composition is the ratio of each particular amino acid to the total number of amino acids in a protein sequence, and then transform it into Biopython scripts as presented in the following example.

GC-content
The comparison of DNA polymerase I gene of Thermophilic, Mesophilic, and Psychrophilic bacteria is based on results of the sequence analysis using Biopython scripts. The results (Table 2 and Figure 6) show that the thermophilic bacterial group consisting of Thermus thermophilus and Geobacillus stearothermophilus have the highest GC-content of DNA polymerase I (polA) gene with 67.86% and 55.21% respectively. GC-content of thermophilic bacteria polA gene is 61.54% which is greater than the mean GC-content of mesophilic (47.96%) and psychrophilic (36.44%) bacteria. The thermophilic bacteria demonstrate a tendency (as presented in Figure 6) to high GC-content due to their adaptation to high temperatures (Kagawa et al., 1984;Musto et al., 2004;Saunders et al., 2003).
These results also suggest that the DNA polymerase I enzymes of thermophilic bacteria are more thermally stable and are ideal for use in the PCR cycle, since there are more GC-content in the sequences. Polymerase Chain Reaction (PCR) is a technique in molecular biology uses to amplify the DNA target sequences from a DNA template by using thermal cycling process (Erlich, 1989;Wages, 2005). PCR requires a temperature-dependent DNA polymerase to enzymatically replicate the desired target sequences in the process (Caetano-Anollés, 2013).

Amino Acids
The length of generated protein sequences from the complete sequence of DNA polymerase I gene of thermophilic, mesophilic, and psychrophilic bacteria ranging from 835-929 amino acids (Table  3). Amino acid composition analysis results using Biopython scripts show that the DNA polymerase I (polA) gene of thermophilic, mesophilic and psychrophilic bacteria have relatively similar amino acid composition. The DNA polymerase genes from the three groups of bacteria are rich in leucine (Leu; L), Alanine (Ala; A), and glutamic acid (Glu; E), and low in cysteine (Cys; C) as presented in Figure 7. A sequence characterization study of DNA polymerase I (Brown et al., 1982), show a comparison of amino acid composition of E. coli DNA polymerase I and its derived fragments that are rich in leucine and glutamic acid, and low in cysteine. Leucine is one of nine essential amino acids and important for many metabolic functions, whereas glutamic acid and cysterin are nonessential amino acids (Lopez & Mohiuddin, 2021;Tessari et al., 2016).

CONCLUSION
The DNA sequence analysis can be done using Biopython including GC-content and amino acid composition analysis. Bio.Seq and Bio.SeqUtils are the main modules of Biopython that used in this paper to compare the GC-content and amino acid composition of the DNA polymerase I (polA) gene of thermophilic, mesophilic, and psychrophilic bacteria. The sequence analysis shows the following results: 1) Thermophilic bacteria have the highest GC-content of the DNA polymerase I (polA) gene compare to mesophilic and psychrophilic bacteria; 2) DNA polymerase I (polA) gene of thermophilic, mesophilic and psychrophilic bacteria have relatively similar amino acid composition.