DIPSBC Tutorial


1. Introduction

In this tutorial we describe how the Solr- and wiki-based data integration system can be used to query indexed data sets. Furthermore, we show how different data sets can be viewed and analyzed with respective helper applications.


2. Index Search Machine

The local search machine provides the main entry point to query all indexed data. Currently the index contains more than 34.9 million records, including protein sequences, gene annotations, molecular interactions, abstracts of publications, and results of transcriptome profiling and mass spectrometry, respectively. The index is based on Lucene / Solr technology and can be queried by keyword, just like a typical web search machine. It is characterized by its quick response times and advanced query syntax.

The following table lists some of the most important syntax rules:

Query type Example Comments
One keyword query "Lipoprotein",   "E2F6" records that contain the word 'Lipoprotein' / 'E2F6'
Negative query (exclusion) "-Lipoprotein" records that do not contain 'Lipoprotein'
Field query "vindex:biomodels",   "identifier:OCT4" Searches specific fields, e.g. vindex (=subindex), or identifier
Boolean combinations: OR, AND (default). AND can be omitted "(oct4 OR klf) AND sox2 AND -vindex:uniprot" records containing OCT4 or KLF together with SOX2, but without Uniprot results
Wild card query (one character) "te?t" records that contain 'test', 'tent', 'text', etc.
Wild card query (multiple characters) "te*t" records that contain 'test', 'tent', 'text', 'termite', 'temperature' etc.
Fuzzy search "proteome~" records that contain terms like protein, proteomics, proteasome, etc.
Range query "field:[0 TO 0.05]",  "vindex:[a* TO z*]" records with values within certain ranges, e.g.
p-values < 0.05 or words starting with certain characters


2.1 Example Query

As an example, the index was queried for the gene name "E2F6". The first 11 result hits are shown below:

tutorial_fig_1.jpg

As stated on top of the result list, the query yielded 285 hits and the query time was 0.53 seconds. The result list shows the title of the records, together with possible annotation, identifier, or content information. At the right of each record, its data type is indicated by color-coded icons. Also, below the data type icon, the individual hits' score - a measure of its relevancy - is given. Each record can be clicked in order to inspect the result in more detail. Generally, there are three possible link events:
  1. Report page - gives a tabular overview of the record
  2. Helper application (Java applet) - enables the user to inspect the record in more detail
  3. Outgoing link - opens an external URL (e.g. BioModels database, ConsensusPathDB etc.)

The following sections describe these events associated with the different result hits in more detail.


2.1.1. Foswiki page

The first hit of the "E2F6" query is a Foswiki page, in this case the page for Affymetrix microarray annotations. It contains the query term in its text body:

res_foswiki.jpg


2.1.2. Molecular interactions

The second hit is a molecular interaction involving E2F6, contained in a Yeast-2-Hybrid experiment. Clicking on the link opens the GraphBrowser applet, which loads the underlying XML file and visualizes the respective interactions. Each node in the interaction graph can be clicked in order to receive its protein name (shown in the right part of the window), and nodes can be collapsed or expanded (i.e., their additional interactions are shown) by right-clicking them.

res_graphBrowser.jpg


2.1.3. Test result tables (transcriptome profiling)

The third hit is part of a test result table, originating from the results of a transcriptome profiling experiment, in this case the Novartis tissue atlas [1]. This study compared the transcription intensity of all genes in more than 70 different tissues, in both human and mouse. Here, clicking on the link brings up a graphical representation of the expression levels of the E2F6 gene in several mouse tissues. More specifically, the log2ratio of each tissue in relation to the mean expression of the remaining tissues is shown. As can be seen, E2F6 is highly expressed in heart and skeletal muscle in relation to other tissues.

res_testResultTable_tissueAtlas.jpg


Other statistical studies stored in DIPSBC can be found by querying "vindex:studies".

res_studies.jpg


The accompanying test result tables in TSV- or XML-format can be downloaded from the respective overview pages. In order to find genes which were particularly differentially expressed, one can simply enter a study number, e.g. "study 4" (a case-control study investigating prostate cancer progression [2]), which will return genes expressed in that study, ordered by score. In our example, the log2ratio and p-value was used to boost the score of the entries, so that highly significant genes will be listed on top of the result page:

res_study4.jpg


In order to check if E2F6 was differentially expressed in study 4, the query term "E2F6 study 4" can be used. The index returns one hit, and after clicking on it, the standard test result table report is shown. The table gives a straight-forward overview of the underlying XML file (STAT-ML), including general study annotation and several numeric values, like absolute expression intensities, coefficient of variation (CV), log2ratio, and p-value:

res_testResultTable_study4.jpg


2.1.4. Gene annotations

Hits nr. 5-7 represent gene annotations, in this case Affymetrix human probeset annotations. An overview table shows general information like gene title, chromosomal location, probe sequences, and GO terms, as well as database cross-references including links to Ensembl, Unigene, SwissProt and ConsensusPathDB [3].

res_annotations.jpg


Furthermore, a link to the Argo genome browser [4] is provided. This Java applet can be used to inspect the genomic region in more detail, as shown below. By clicking while holding down the shift key one can zoom into the genomic region (zoom out by right-clicking and holding down the shift key). Clicking on a feature (probe, probeset, qtl etc.) provides additional information about its title, location, target, and feature type.

res_argo.jpg


2.1.5. BioModels

The next hit is a link to the BioModels database, where computational models of biological processes can be utilized. Clicking on the link forwards the user to the respective BioModels page. Our E2F6 example is an uncurated model lacking some features of curated models; therefore we suggest another example query: "biomodels cdc2", in order to retrieve information about cell cycle related models. The following picture shows the first hit, a model of cdc2 and cyclin interactions during the cell division cycle. The network graphics can be retrieved by clicking 'Actions' -> 'View Bitmap Reaction Graph'.

res_biomodels_cellCycle.jpg


2.1.6. ConsensusPathDB

The next result list hit is a link to the ConsensusPathDB (CPDB [3]), where interactions of the human E2F6 protein can be visualized. CPDB integrates several kinds of interaction data (gene regulation, protein-protein-interactions, metabolic pathways) from other public databases like KEGG, Reactome, or Biocarta. After clicking on the link, all available interactions are listed and can be selected at once or individually for the resulting network:

res_cpdb_selection.jpg

The interaction graph can be visualized by clicking on 'Map and visualize interactions'. The resulting network can be modified and extended dynamically.

res_cpdb.jpg


2.1.7. Uniprot FASTA sequence

The last example of our result list is the Uniprot FASTA sequence for the bovine E2F6 protein (shown below). Of course, the human sequence may be easily retrieved by adding 'homo sapiens' to the query.

res_uniprot.jpg


2.2. Refined query

In order to check which publications are dealing with the E2F6 protein, we can use the refined query "E2F6 vindex:pubmed", thereby limiting the search to PubMed entries only. This yields 69 PubMed results in 0.33 seconds (note that the index only contains publications starting from 1970).

query_vindexPubmed.jpg


2.3. Additional data types: peptide mass spectra

Our index also contains mass spectra; however, there was no spectrum found in relation to the E2F6 protein. As another example, one can search for the APC protein, the fourth hit represents the respective spectrum (alternatively, all available spectra can be found with the query "vindex:mass_spectra"). The result page for this spectrum shows a table of identified peptides. Also, the data can be reanalyzed by submitting the underlying XML to the Mascot search engine. Finally, in order to view the spectrum, one can click 'Launch Applet'. This starts the 'mzData viewer', which shows the peptide peaks. One can zoom in by marking the desired spectrum region, and zoom out by dragging the mouse to the left on the spectrum.

res_mzdata.jpg



References

  1. Su AI et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 101(16):6062-7 (2004).
  2. Varambally et al. Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8(5):393-406 (2005).
  3. Kamburov A et al. ConsensusPathDB - toward a more complete picture of cell biology. Nucleic Acids Res 39:D712-7 (2011).
  4. Engels R et al. Combo: a whole genome comparative browser. Bioinformatics 22(14):1782-3 (2006).

Topic attachments
I Attachment Action Size Date Who Comment
query_vindexPubmed.jpgjpg query_vindexPubmed.jpg manage 231.2 K 31 Aug 2011 - 11:24 FelixDreher  
res_annotations.jpgjpg res_annotations.jpg manage 158.1 K 31 Aug 2011 - 11:24 FelixDreher  
res_argo.jpgjpg res_argo.jpg manage 153.6 K 31 Aug 2011 - 11:24 FelixDreher  
res_biomodels_cellCycle.jpgjpg res_biomodels_cellCycle.jpg manage 268.3 K 31 Aug 2011 - 11:24 FelixDreher  
res_cpdb.jpgjpg res_cpdb.jpg manage 221.4 K 31 Aug 2011 - 11:24 FelixDreher  
res_cpdb_selection.jpgjpg res_cpdb_selection.jpg manage 164.3 K 31 Aug 2011 - 11:24 FelixDreher  
res_foswiki.jpgjpg res_foswiki.jpg manage 118.3 K 31 Aug 2011 - 11:24 FelixDreher  
res_graphBrowser.jpgjpg res_graphBrowser.jpg manage 161.7 K 31 Aug 2011 - 11:24 FelixDreher  
res_mzdata.jpgjpg res_mzdata.jpg manage 223.8 K 31 Aug 2011 - 11:24 FelixDreher  
res_studies.jpgjpg res_studies.jpg manage 157.8 K 31 Aug 2011 - 11:24 FelixDreher  
res_study4.jpgjpg res_study4.jpg manage 213.8 K 31 Aug 2011 - 11:24 FelixDreher  
res_testResultTable_study4.jpgjpg res_testResultTable_study4.jpg manage 62.3 K 31 Aug 2011 - 11:24 FelixDreher  
res_testResultTable_tissueAtlas.jpgjpg res_testResultTable_tissueAtlas.jpg manage 124.6 K 31 Aug 2011 - 11:24 FelixDreher  
res_uniprot.jpgjpg res_uniprot.jpg manage 67.4 K 31 Aug 2011 - 11:24 FelixDreher  
tutorial_fig_1.jpgjpg tutorial_fig_1.jpg manage 239.6 K 31 Aug 2011 - 11:24 FelixDreher  
Topic revision: r2 - 31 Aug 2011 - 11:24:01 - FelixDreher
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback