Structural alignmentStructural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions.
Sequence logoIn bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences). A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences. Sequence logos are frequently used to depict sequence characteristics such as protein-binding sites in DNA or functional units in proteins. A sequence logo consists of a stack of letters at each position.
Coding regionThe coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.
Substitution matrixIn bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix.
CASPCritical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users.
FASTA formatIn bioinformatics and biochemistry, the FASTA format is a text-based for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format allows for sequence names and comments to precede the sequences. It originated from the FASTA software package, but has now become a near universal standard in the field of bioinformatics. The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages.
BioRubyBioRuby is a collection of open-source Ruby code, comprising classes for computational molecular biology and bioinformatics. It contains classes for DNA and protein sequence analysis, sequence alignment, biological database parsing, structural biology and other bioinformatics tasks. BioRuby is released under the GNU GPL version 2 or Ruby licence and is one of a number of Bio* projects, designed to reduce code duplication. In 2011, the BioRuby project introduced the Biogem software plugin system, with two or three new plugins added every month.
BioPerlBioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project. BioPerl is an active open source software project supported by the Open Bioinformatics Foundation. The first set of Perl codes of BioPerl was created by Tim Hubbard and Jong Bhak at MRC Centre Cambridge, where the first genome sequencing was carried out by Fred Sanger.
BiopythonThe Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI.
GenBankThe GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC). GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 500,000 formally described species.