2010年11月23日 星期二

Bioinformatics in Health Sciences Mini Research Project Final Report

Task 1 Research topic:
Sequential Alignment

Task 2 Journal review:
Nucleic Acids Research – Analysis of protein sequence and interaction Data for candidate disease gene prediction
Summary:
Inflammatory bowel disease (IBD) is a group of inflammatory conditions of the colon and small intestine, new evidence show that IBD may have an elevated risk of endothelial dysfunction and coronary artery disease.  Hence is it crucial to know the functional loci and to identify the relationship between the significance of the functional loci.

Task 3:
By the use of National Center for Biotechnology Information (NCBL) data bank, choose two relatively similar structure and sequence to distinguish the significance.

Task 4:
Method:
        Two similar bowel protein sequences are chosen from NCBI data bank, use GenBank, EMBL and FASTA formats for comparison; follow by format FASTA to copy both sequence data to the notepad.  Result will be shown in graphics showing the genome/chromosome map location of the gene.  Further run both structures using bioinformatics tool BLASTp and compare the similarities of both sequences.

Task 5:
The following proteases are selected from National Center for Biotechnology Information (NCBI):
1.    NCBI Reference Sequence: NP_033429.1
2.    NCBI Reference Sequence: NP_005840.2


FASTA format of both sequences:

First sequence:
NCBI Reference Sequence: NP_005840.2
>gi|93004094|ref|NP_005840.2| immunoglobulin superfamily member 6 precursor [Homo sapiens]
MGTASRSNIARHLQTNLILFCVGAVGACTLSVTQPWYLEVDYTHEAVTIKCTFSATGCPSEQPTCLWFRYGAHQPENLCLDGCKSEADKFTVREALKENQVSLTVNRVTSNDSAIYICGIAFPSVPEARAKQTGGGTTLVVREIKLLSKELRSFLTALVSLLSVYVTGVCVAFILLSKSKSNPLRNKEIKEDSQKKKSARRIFQEIAQELYHKRHVETNQQSEKDNNTYENRRVLSNYERP

Second sequence:
NCBI Reference Sequence: NP_033429.1
>gi|6678387|ref|NP_033429.1| tumor necrosis factor ligand superfamily member 8 [Mus musculus]
MEPGLQQAGSCGAPSPDPAMQVQPGSVASPWRSTRPWRSTSRSYFYLSTTALVCLVVAVAIILVLVVQKKDSTPNTTEKAPLKGGNCSEDLFCTLKSTPSKKSWAYLQVSKHLNNTKLSWNEDGTIHGLIYQDGNLIVQFPGLYFIVCQLQFLVQCSNHSVDLTLQLLINSKIKKQTLVTVCESGVQSKNIYQNLSQFLLHYLQVNSTISVRVDNFQYVDTNTFPLDNVLSVFLYSSSD

By running BLASTp tool, the following result is obtained:

BLAST
Basic Local Alignment Search Tool
Blast 2 sequences:
Protein Sequence (241 letters)
Results for: Your BLAST job specified more than one input sequence. This box lets you choose which input sequence to show BLAST results for.
Query ID: lcl|25993
Description: None
Molecule type: amino acid
Query Length: 241
Subject ID: 25995
Description: None
Molecule type: amino acid
Subject Length: 239
Program: BLASTP 2.2.24+ Citation
Search parameter name Search parameter value
Program: blastp
Word size: 3
Expect value: 10
Hitlist size: 100
Gapcosts: 11,1
Matrix: BLOSUM62
Filter string: F
Genetic Code: 1
Window Size: 40
Threshold: 11
Composition-based stats: 2
Karlin-Altschul statistics
Params Ungapped Gapped
Lambda   0.317322        0.267
K            0.130185        0.041
H             0.38006 0.14
Results Statistics
Results Statistics parameter name Results Statistics parameter value
Effective search space              47088

Graphic Summary
Distribution of 2 Blast Hits on the Query Sequence
An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments. New: This graphic is an overview of database sequences aligned to the query sequence. Alignments are color-coded by score, within one of five score ranges. Multiple alignments on the same database sequence are connected by a dashed line. Mousing over an alignment shows the alignment definition and score in the box at the top. Clicking an alignment displays the alignment detail.
Dot Matrix View
Plot of lcl|25993 vs 25995
This dot matrix view shows regions of similarity based upon the BLAST results. The query sequence is represented on the X-axis and the numbers represent the bases/residues of the query. The subject is represented on the Y-axis and again the numbers represent the bases/residues of the subject. Alignments are shown in the plot as lines. Plus strand and protein matches are slanted from the bottom left to the upper right corner, minus strand matches are slanted from the upper left to the lower right. The number of lines shown in the plot is the same as the number of alignments found by BLAST.
Descriptions
Legend for links to other resources:
UniGene   GEO       Gene       Structure         Map Viewer    PubChem BioAssay

Sequences producing significant alignments:
Accession Description               Max score              Total score             Query coverage               Evalue Links
25995          unnamed protein product        21.9                        38.5                        36%                                   0.012



Alignments
>lcl|25995 unnamed protein product
Length=239
Sort alignments for this subject sequence by:
E value Score Percent identity Query start position Subject start position

Score = 21.9 bits (45), Expect = 0.012, Method: Compositional matrix adjust.
Identities = 17/66 (26%), Positives = 29/66 (44%), Gaps = 11/66 (16%)
Query 184
LRNKEIKEDS-----QKKKSARRIFQEIAQELYHKRHVETNQQSEKDN------NTYENR         232
L N +IK+ +        ++ I+Q ++Q L H   V +      DN      NT+
LINSKIKKQTLVTVCESGVQSKNIYQNLSQFLLHYLQVNSTISVRVDNFQYVDTNTFPLD         238
Sbjct 168

Query 233
RVLSNY                                                               238
   VLS +
NVLSVF                                                                                                                                                                        233
Sbjct 228

Score = 16.5 bits (31), Expect = 0.54, Method: Compositional matrix adjust.
Identities = 11/49 (23%), Positives = 19/49 (39%), Gaps = 8/49 (16%)

Query 152
RSFLTALVSLLSVYVTGVCVAFILLSKSKSN--------PLRNKEIKED  192
RS+     +kL   V nVj +L+m+ + Kkj      PL+     ED RSYFYLSTTALVCLVVAVAIILVLVVQKKDSTPNTTEKAPLKGGNCSED  90
Sbjct 42

Results and Discussion:
 
After comparing the two selected sequences using BLAST, result show alignment between both sequences, this also suggests that there is a relationship between both sequences and the bowel disease.
For further research, additional sequences are required for analysis to compare the result of the current sequences.  If the additional sequences show to be more relevant than the current sequences, hence the additional sequences are said to be more dependent to the inflammatory disease; however if the additional sequences are not as relevant as the current sequences, the current sequences are said to be more dependent to the inflammatory disease.

2010年10月19日 星期二

Bioinformatics in Health Sciences Mini Research Project submission

HTI 5052         Bioinformatics in Health Sciences
Mini Research Project

Student Name: Tse Mavis Wing Yee
Student No. : 10677249G

Task 1:
Research topic is Sequential Alignment.

Task 2:
Journal review: Nucleic Acids Research
Journal topic: AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis
Mohamed Radhouene Aniba1,2,3,4, Olivier Poch1,2,3,4, Aron Marchler-Bauer5 and Julie Dawn Thompson1,2,3,4,*
1Department of Structural Biology and Genomics, Institut de Ge´ ne´ tique et de Biologie Mole´ culaire et Cellulaire (IGBMC), 2Institut National de la Sante´ et de la Recherche Me´ dicale (INSERM), 3The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch, 4Universite´ Louis Pasteur, F-67000 Strasbourg, France and 5NCBI/NLM/NIH, 8600 Rockville Pike, Bldg. 38A, Bethesda, MD 20894, USA
Received January 29, 2010; Revised April 26, 2010; Accepted May 25, 2010
Multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequence, the input set of query sequences are assumed to have share a common ancestor via evolution lineage, to conduct the evolutionary origins of the sequences, phylogenetic analysis can be conducted; as a resulting MSA, phylogenetic analysis can be conducted.  MSAs are more computationally complex, hence require more sophisticated methodologies than pair-wise alignment.
Nowadays many different algorithms are developed to construct MSAs, however a new technology AlexSys is an intelligent engine which based on the sequence input to auto select appropriate aligner a priori, it is suitable for high throughput project according to the good compromise between alignment quality and the operation duration.  Previous studies show that not a single aligner outperform the other; even though previous methods provide more accurate alignment, they are less efficient due to the need to run the sequence and the best solution is said to be a posteriori.  Therefore AlexSys is designed to combine the power of the existing approaches in a single system which is both efficient and easy to use.

Task 3:
AlexSys is therefore designed to combine the power of the existing approaches in a single system which is both efficient and easy to use for the biologist; however, there is no single algorithm that works best on all problems.
The Problem can be solved by alter the system from a combined system to a multiply-step single system, the disadvantage however is the duration of collecting data will be longer.