Computational Tools and Resources – whatsnewinpreprint.com

Computational tools and resources in bioinformatics encompass a wide array of software, databases, algorithms, and technologies designed to analyze, interpret, and manage biological data. These tools aid researchers and scientists in exploring genetic sequences, protein structures, and other biological information. Important software tools and resources are:

Software Tools: Programs and applications developed for specific bioinformatics tasks such as sequence alignment (e.g., BLAST), genome assembly (e.g., SPAdes), protein structure prediction (e.g., SWISS-MODEL), and data visualization (e.g., IGV).

Databases and Repositories: Centralized repositories storing biological data like genomic sequences (GenBank, ENA), protein sequences (UniProt), genetic variations (dbSNP), and 3D protein structures (PDB).

Algorithms and Computational Methods: Mathematical and statistical algorithms designed to analyze DNA/RNA sequences, predict protein structures, identify motifs or patterns, perform phylogenetic analysis, and more. Examples include sequence alignment algorithms (Needleman-Wunsch, Smith-Waterman), clustering methods, and machine learning algorithms applied to biological data.

Web-based Tools and Servers: Online platforms providing access to various bioinformatics tools, enabling researchers to perform analyses remotely without needing to install software locally. Examples include EMBOSS, Galaxy, and NCBI’s suite of tools.

Data Integration and Analysis Platforms: Comprehensive platforms allowing integration, analysis, and interpretation of multiple omics data types (genomics, transcriptomics, proteomics) to generate insights into biological systems. Examples include platforms like Bioconductor for R programming language and platforms offered by commercial entities like Illumina or Thermo Fisher.

These tools and resources are essential for managing the vast amounts of biological data generated by modern technologies and are integral to advancing research in fields such as genetics, medicine, agriculture, and beyond. They empower scientists to extract meaningful information, uncover patterns, and make significant discoveries in the biological sciences.

Computational biology tools and resources

Programming in bioinformatics

Although there are many possibilities to write code in the field of Bioinformatics, Python has become one of the most used languages in many domains, including Bioinformatics, for several reasons:

Ease of Learning and Readability: Python’s simple and readable syntax makes it accessible to beginners. Its clean and straightforward code structure allows researchers with biological expertise to quickly learn and apply programming concepts.
Versatility and Extensive Libraries: Python boasts a rich ecosystem of libraries and tools specifically tailored for bioinformatics tasks. Libraries like Biopython offer a wide range of functionalities, from sequence manipulation to accessing biological databases, simplifying complex tasks.
Community Support: Python has a vast and active community in bioinformatics. This community continually develops and maintains bioinformatics-specific tools and resources, providing a wealth of tutorials, documentation, and forums for support and collaboration.
Integration Capabilities: Python interfaces well with other languages and software, facilitating integration with existing bioinformatics tools written in different languages. It can easily incorporate modules from C/C++, Java, or R, enhancing its capabilities.
Data Handling and Analysis: Python’s libraries, such as Pandas and NumPy, excel in data manipulation, making it efficient for handling large biological datasets, performing statistical analyses, and visualizing results.
Adaptability to Rapid Prototyping: In research environments, Python’s flexibility allows for quick prototyping and testing of algorithms or ideas. Its rapid development capabilities enable researchers to iterate and refine their approaches swiftly.
Availability and Portability: Python is open-source and platform-independent, making it accessible across various operating systems. This ensures consistency and compatibility in bioinformatics workflows, regardless of the computational environment.

The following code shows an example of how data can be retrieved from UniProt database:

from Bio import ExPASy
from Bio import SwissProt

# Accessing a specific UniProt entry by its ID
accession_id = "P21802"  # Replace with your desired UniProt ID
handle = ExPASy.get_sprot_raw(accession_id)
record = SwissProt.read(handle)

# Printing basic information about the protein
print(f"Accession ID: {record.accessions[0]}")
print(f"Protein Name: {record.entry_name}")
print(f"Organism: {record.organism}")
print(f"Description: {record.description}")
print(f"Sequence Length: {len(record.sequence)}")

Will output:

Accession ID: P21802
Protein Name: FGFR2_HUMAN
Organism: Homo sapiens (Human).
Description: RecName: Full=Fibroblast growth factor receptor 2; Short=FGFR-2; EC=2.7.10.1 {ECO:0000269|PubMed:16844695, ECO:0000269|PubMed:18056630, ECO:0000269|PubMed:19410646, ECO:0000269|PubMed:21454610}; AltName: Full=K-sam; Short=KGFR; AltName: Full=Keratinocyte growth factor receptor; AltName: CD_antigen=CD332; Flags: Precursor;
Sequence Length: 821

Another example of how using the BLAST algorithm for nucleotide alignment is show. To execute the alignment, there needs to be a file called sequence.fasta with the following nucleotide sequence (alcohol dehydrogenase, extract).

>NC_000004.12:c99290985-99276369 ADH1A [organism=Homo sapiens] [GeneID=124] [chromosome=4]
TAAGTAATTCAGAAATGGAGAACCAAACATCATATGTTCTCATTTATAAGAGAGAGTTAGGCTATGAGGA
TGCAAAGGCATGAGAATGATATCATGAACTTTGGGAACTCGAGGGGGAAGGTTGAAAGGGGAGGTGAGGG
ATAAAAGACTACATATTGGGTGCAATGTACACTGCTTGAGTGAAGGGTGCACCAAAATCTCAGAAATCAC
CACTAAAGAACTTACCCAGGTAACCAGAAAACACCTGCACCCCAAAAACTATTGAAATTAAAAATAAATT
TTATAAATAAATATAATGCAAATAACTCACCGAGCACAGGCCACAATGTGCCTGTCACATCATAGGCACC
TGATAATCAGGAACTCCTACTATTAGTACTTTACTTCCAAGTATTCTGATATTTATTTGGCATTGATCAT
GTTGAACATCCTGCAAAGTTGAATAATTTGACAAATAAATAATGACAACGAATTATCAGAAGAACAAATT
TCAATTGAATTTTTATCATAATTGCAATTTTGTTGATTCTCTCTGGAAAGAGTTTTGTGAATTGTACATG
CTTAACCACAATGAAGACAGATATTACAGTTCACTGCCAAGCACTGGAAATTATGTGTACATCTTTTATT
TTTAAGAACATCTATGGATAAGATCTGCTAGCTAGAAATTCAAACACACAGTGTCTCCAATTTTGAAATA
TGTATGTGTAGGTCTGAGTGTATACACACACACACATACACACACACACACACACACACCCAACTCTAAC
CTTTGTAATTTGGGGCTCAAATCAAAGCTGGAAGGAGTCTCTACACATAGATCAGATAACAGTACAGCCT
CTCAAAGCTAAGTTGCAAATGGGTGTTGGAACAAGTAACACACAGAAGTGGTGGAGAGAGAAAATTAAAA
GAGGAGAATAGGAAAAGAGAGAGAAAGAGATGTCATCCCAGGCAGGAAGGTAAATATATCCTTATAAACT
CAGATCCTTTTGGCCTAAGCAGAAAATTGAGAAAGAAAACAGTGTATTTTATGTGAACAGACACTGACTC

Then the following code can be used to create an alignment (can take up to several minutes):

from Bio.Blast import NCBIWWW
from Bio import SeqIO

# Load your sequence from a file
input_file = "sequence.fasta"  # Replace with your sequence file in FASTA format
sequence = SeqIO.read(input_file, format="fasta")

# Perform a BLAST search against the NCBI database
result_handle = NCBIWWW.qblast("blastn", "nt", sequence.seq)  # Use blastn for nucleotide sequences

# Write the results to an output file (optional)
output_file = "blast_result.xml"
with open(output_file, "w") as out_handle:
    out_handle.write(result_handle.read())
result_handle.close()

It will output an xml file with the following content (extract):

Homo sapiens alcohol dehydrogenase 1C (class I), gamma polypeptide (ADH1C), RefSeqGene on chromosome 4

AAGTAATTCAGAAATGGAGAACCAAACATCATATGTTCTCATTTATAAGAGAGAGTTAGGCTATGAGGATGCAAAGGCATGAGAATGATATCATGAACTTTGGGAACTCGAGGGGGAAGGTTGAAAGGGGAGGTGAGGGATAAAAGACTACATATTGGGTGCAATGTACACTGCTTGAGTGAAGGGTGCACCAAAATCTCAGAAATCACCACTAAAGAACTTACCCAGGTAACCAGAAAACACCTGCACCCCAAAAACTATTGAAATTAAAAATAAAT----TTTATAAATAAATATAATGCAAATAACTCACCGAGCACAGGCCACAATGTGCCTGTCACATCATAGGCACCTGATAATCAGGAACTCCTACTATTAGTACTTTACTTCCAAGTATTCTGATATTTATTT-GGCATTGATCATGTTGAACATCCTGCAAAGTTGAATAATTTGACAAATAAATAATGACAACGAATTATCAGAAGAACAAATTTCAATTGAATTTTTATCATAATTGCAATTTTGTTGATTCTCTCTGGAAAGAGTTTTGTGAATTGTACATGCTTAACCACAATGAAGACAGATATTACAGTTCACTGCCAAGCACTGGAAATTATGTGTACATCTTTTATTTTTAAGAACATCTATGGATAAGATCTGCTAGCTAGAAATTCAAACACACAGTGTCTCCAATTTTGAAATATGTATGTGTAGGTCTGAGTGTATACACACACACACATACACACACACACACACACACA--CCCAACTCTAACCTTTGTAATTTGGGGCTCAAATCAAAGCTGGAAGGAGTCTCTACACATAGATCAGAT-AACAGTACAGCCTCTCAAAGCTAAGTTGCAAATGGGTGTTGGA-ACAAGTAACACACAGAAGTGGTGGAGAGAGAAAATTAAAAGAGGAGAATAGGAAAAGAGAGAGAAAGAGATGTCATCCCAGGCAGGAAGGTAAATATA------TCCTTATAAACTCAGATCCTTTTGGCCTAAGCAGAAAATTGAGAAAGAAAACAGTGTATTTTATGTGAACAGACACTGACTC
AAGTAACTCAGGAATGGAAAACCAAATACCATATGGTCTCACTTATAAGTGGGAGCTAAGCTATGAGGATACAAAGGCATAAGAATGATATAGAGGACTTTGG------------------------------TTAGGGATAAAAGACTACACGTTGGATACAGTGTACACTGCTCGGGTGACACATGCAGCAAA-TCTCAAAAACCACCACTAAAGAACTTATCCATGTAATCAAAAACAACCTGCACCCCAAAAACTATTGAAATTAAAAATAAATAAATTTTAAAAATAAAAATAATGCATGTAACTCACTCAGCACAAGCCACAGTGCACCTGTCACATCATAGGCACCTGACAATCAGAAACTCCTACTATTAGAACTTTATTTCCAAGTATTCTGATATTTATTTTGGCATGGACCACATTGAAAATCCTGCAAAGTTGAAGGATTTGACAAATGAATAATGAAAATTAATTGGCAGAAGATGAAATTTCAATTGAATTTTTATTATATTTGCAATTATGTTGATTCTGTAAAGAAAGAGCTCAGTGAATTGTACATGCTTTACCACAATAAAGATGGGTTTTACAGTACACTGCCAAGCACTGGAAA-TATGTGATCAGCTTTTATTTTTAAGGACATCTGTGGATAAGCT-TGCTAGCTAGAAACTCAAATACACAGTTTCTCCAGTTTTTAAATACATGTGTGT-GGTCTGAATGCACACACACACACACACACACAGAAACACACATACACAACACCAACTCTAACCTCTGTAATTTTGGGCTCAAATCTAAGCTGGAAGGAGTCTCTGCACATAAATTAGATGAACAAAACAGGCTGCCAAAGCTAAGTTTCAAATGGGTGTTTGAGACAAGTA--ACACAGAA---GTGGAGAGAGAAAATTAAAAGAGGACAATAGGCAA--ATAGAGAAAGAGACATCATCCTAAGCAGGAAGGTAAATGTATAACTGTTTTTATGGAGTCAGATCCTTTTGGCCTAAGCAGAAAATTGAGCAAGAAAGCAGTGTATTT----TGGACAGACACTGACTC

The two sequence parts are the alignment with several indels. Indels, short for “insertions and deletions,” are a common type of mutation observed in biological sequences such as DNA, RNA, or protein sequences. They represent the presence of additional nucleotides or amino acids (insertions) or the absence of one or more nucleotides or amino acids (deletions) in a sequence concerning a reference or another aligned sequence.

In the context of sequence alignment in bioinformatics, indels are essential elements in comparing sequences. When aligning two or more sequences to identify similarities or differences, gaps in the alignment are introduced to account for these insertions or deletions. Indels contribute to the variations observed between sequences and can reveal evolutionary relationships, functional changes, or structural variations.

For example, consider aligning two DNA sequences:

Sequence 1: ATGC-TAAGCT
Sequence 2: ATGCTAAGCT

Here, the “-” in the first sequence represents an insertion relative to the second sequence. In the alignment, this would be depicted as an insertion (indel) in Sequence 1 compared to Sequence 2. Similarly, if Sequence 1 lacked a nucleotide present in Sequence 2, it would result in a deletion (indel) in Sequence 1.

Indels are crucial in evolutionary studies, as they can signify genetic rearrangements, gene duplications, or deletions that might have functional implications. In structural biology, indels might alter the secondary or tertiary structure of proteins, impacting their functions.

Analyzing indels within sequence alignments aids in understanding the evolutionary history, functional diversification, and structural variations among different biological sequences, contributing significantly to the field of bioinformatics and molecular biology.

Databases and repositories

Databases and repositories in bioinformatics play a crucial role in storing, organizing, and providing access to vast amounts of biological data generated by research efforts worldwide. The most prominent ones are:

GenBank: Managed by the National Center for Biotechnology Information (NCBI), GenBank is a comprehensive database for storing DNA sequences submitted by researchers globally. It includes sequences from various organisms, annotated with information about genes, proteins, and their functions.
UniProt: UniProt is a collaboration between several organizations and is an extensive repository of protein sequences and functional information. It contains high-quality, curated, and annotated protein sequences along with details on their functions, domains, structures, and interactions.
NCBI (National Center for Biotechnology Information): NCBI is a key resource providing access to various databases and tools, including GenBank, PubMed (for biomedical literature), BLAST (sequence similarity search), PubMed Central (full-text articles), and many more. It’s a central hub for biological information and tools.
PDB (Protein Data Bank): PDB is the primary repository for 3D structural data of biological macromolecules, such as proteins and nucleic acids. It provides detailed atomic coordinates, structures, and information about molecular interactions.
Ensembl: Ensembl is a genome browser and database offering genome sequences, gene annotations, variations, and comparative genomics data for various organisms. It provides a user-friendly interface to explore and analyze genomic information.
KEGG (Kyoto Encyclopedia of Genes and Genomes): KEGG is a resource providing information on biological pathways, diseases, drugs, and genomes. It offers curated data on metabolic pathways, signaling networks, and functional annotations.
Reactome: Reactome is a database of biological pathways, focusing on human biological processes. It contains detailed pathway diagrams, descriptions, and annotations for molecular events in various cellular processes.

These databases and repositories serve as invaluable resources for researchers, enabling them to access, analyze, compare, and interpret biological data. They support a wide range of studies, from genome analysis and protein structure determination to pathway analysis and functional annotations, fostering discoveries and advancements in the life sciences.

Next up is Sequence analysis