Mouse SNP Miner user manual

 

Introduction

Mouse SNP Miner is an annotated database with a web-based runtime system for browsing missense, nonsense, frameshift, and splice-site mutations between mouse inbred strains based on the Ensembl prediction algorithms. It uniqueness is in functional annotation of missense mutations using the PolyPhen and PANTHER algorithms and easy linking and access to various external databases, such as OMIM, GO, Symatlas expression database and easy linking between gene accession and GO terms using the GeneMerge clustering algorithm. Additional to the functional annotation Mouse SNP Miner composed of a runtime graphical display that allows chromosome walking, in-and-out zooming, and jumping between adjacent SNPs. Mouse SNP Miner database is highly useful for identify functional DNA sequence variations within quantitative trait loci (QTL) in an integrated system.

 

System requirements

Mouse SNP Miner data is stored in a MySQL rational database and communicates with a Java web based applet that works only with Java enabled browsers. To use this applet you need JVM 1.4 and above installed in your system. The last JVM release is freely available from Java download site

 
How to search Mouse SNP Miner

Mouse SNP Miner web interface is composed of three modules:

 

Query module

The query module (Figure 1) allow to filter SNPs according to the following filtering chriteria:
 

Figure 1. Mouse SNP Miner query form.

Strains

A user can either search for SNPs between different strains, group of strains or between group of strains to strains, assuming that between different strains from the same group the variation is not significant, but could add some relevant data to the comparison.

 

Gene Symbol

The gene symbol. A user may insert different symbols using semicolon delimited.

 

Ensembl Transcript

Ensembl transcript stable ID (e.g., ENSMUST#)

 

SNP Accession

The SNP accession defined by dbSNP

 

Gene Ontology (GO)

Gene Ontology accession (e.g., GO#). 

 

Chromosome Viewer

‘SNP View’ mode presents results from the search in a graphical format for convenient run-time scanning. SNPs in the interval are listed by functional consequence and PolyPhen prediction in the upper left and can be rapidly added or removed by clicking on the associated box. Boxes indicating transcripts and lines indicating SNPs are color and symbol coded by functional consequence in the graphical display. Boxes on the top and above the line represent the plus and minus strand in concordance. Clicking on a SNP causes detailed SNP information to be displayed in the ‘Details’ window above. The ‘Associations’ window displays GO, OMIM, and PolyPhen information and links for the selected SNP. Movement across the chromosome and between SNPs is facilitated by buttons at the bottom of the graphical display. In the example shown, a search has been performed for SNPs differing between C57BL/6J and all 129 group.

 

 A   B

Figure 2. SNPs distribution between C57BL/6J and 129 group. A user can select prediction type (e.g., PolyPhen or PANTHER) for three different type of categories: a) deleterious b) non-deleterious and c) Unknown. Selecting both prediction will result in intersection result. A subset of SNPs in range of ~ 15Mb interval was selected (Figure 2A) using mouse pointer drug-and-drop selection. Further focusing on ~ 0.5 Mb in the same region sows stop-codon and deleterious SNP rich regions (Figure 2B ). Small squares in the chromosome boundaries represents SNPs and their location on the transcripts (below and above chromosome boundaries). Selecting SNP from the graphical layout summaries the SNP and gene details (Figure 2AB demonstrates details of Klra5 gene)

 

Predictions

We have used PolyPhen and PANTHER algorithms to assess nsSNP (non-synonymous SNP) consequence to damage protein function. The PolyPhen prediction method is based on four criteria: 1) conservative amino acid substitution (e.g. polar to non-polar, charged to non-charged), 2) sequence conservation among orthologs, 3) changes in free energy derived from molecular structure modeling (e.g. significant deviations in dihedral angle or backbone folding), and 4) overtly compromised amino acid function (e.g. loss of glycosylation site, phosphorylation site, metal binding residue). PANTHER Version 6.0 library contains a set of over 5,000 protein families and about 30,000 subfamilies derived from those families, each represented by a multiple sequence alignment and Hidden Markov Model (HMM). The subfamilies are a subset of selected proteins that can be associated with functional classification (cellular process and molecular function) using manual expert curation. Missense SNPs can be scored against these HMM families to estimate their likelihood of disrupting conserved amino acid elements, and thus protein function using the subPSEC (Substitution Position Specific Evolutionary Conserved) algorithm. The following describes PolyPhen and subPSEC statistics scores.

PolyPhen:

PolyPhen prediction is based on straightforward empirical rules which predict the SNP effect to be Probably damaging, Possibly damaging, Benign and unknown

subPSEC:

prediction based on evolutionary common conserved amino acids that based on the following statistics scores:

1) subPSEC - score estimates the likelihood of a functional effect from a single amino acid substitution. Range of values from 0 (neutral) to about -10 (most
likely to be deleterious).  -3 is the identified cutoff point for functional significance.

2) P Deleterious - the probability of a nsSNP being deleterious. cutoff 0.5 (50% deleterious) equals to subPSEC of -3.

3) HMM -  The input protein sequence is scored against the HMMs in the PANTHER library. The alignment to the HMM with the most significant score is used for the analysis. This can be a subfamily (indicated with :SF), or a family

 4) NIC (number of independent counts) - is an estimate of the number of independent observations used to calculate the amino acid probabilities. The probabilities are calculated from a combination of prior knowledge (e.g. that isoleucine often substitutes for valine) and observations, so the larger NIC, the more the probabilities rely on the amino acids observed in the multiple sequence alignment.

5) Messages:

-"wild type and substituted amino acids required" - this indicates that you did not give amino acids, or you did not give two amino acids
-"invalid amino acid" - you did not give one of the 20 amino acids
-"Missing sequence" - the protein is not in the fasta file
-"SNP position not within protein" - the position of the SNP does not exist in the protein.
-"wild type amino acid is ..." - this means none of the input amino acids match the amino acid in the protein
-"substitution position incorrect" - the position given for the protein is incorrect

 

 

Data Export and Association frame

The data export module (Figure 3) allow the user to export data from the database in tab delimited format with the option to perform a GO clustering using the GeneMerge algorithm.

 

Figure 3. Data export and Association frames. Klicking on the arrow (->) button will open OMIM and EBI QuickGO websites in correspondent.