Categorizer categorizer_icon

Mirror sites

Asia server: http://ssbio.cau.ac.kr/software/categorizer
North America server: http://chibi.ubc.ca/categorizer

Citation

Categorizer is published in BMC Genomics 2014, 15:1091.

Contact

For any inquries, please contact us via Dokyun Na (dna@ssbio.cau.ac.kr) or Joerg Gsponer (gsponer@chibi.ubc.ca).

Download

Pre-compiled Windows version
or
Python source codes


Introduction

Categorizer v1.0 is a tool to classify genes into user-defined groups (categories) based on GeneOntology (GO) annotations and their semantic similarities. Most GO-based analysis tools are designed to identify enrichments of individual GO terms in a set of genes, and they frequently output lists of redundant or highly specific GO terms that can be difficult to interpret. Categorizer assigns genes to user-defined categories and calculates p-values for the enrichment of each category. This new tool takes advantage of the hierarchical structure of GO annotations and the semantic similarity between GO terms for a reliable categorization. Categorizer will help experimental and computational biologists analyzing genomic and proteomic data according to their specific interests.

For detailed information on semantic similarities, refer to the supplementary information of our paper and the following articles:

Lord PW, Stevens RD, Brass A, Goble CA. 2003. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19:1275-1283

Wu X, Pang E, Lin K, Pei Z-M. 2013. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS ONE 8:e66745


Installation

Categorizer was implemented using a platform-independent programming language, Python. The program can be run on any operating system where Python and the required libraries are installed. For user's convenience, we provide a pre-compiled verson of Categorizer that runs on the Windows operating system. For other operating systems, the program must be run from the source code.

For Windows users, download the three tools below

Please note that this compiled version uses only a single core due to a compiling issue. If you need to categorize thousands of genes/proteins, we recommend running from the source code, which supports use of multiple cores.


For those who want to run from the source code, please download this


In order to run Categorizer from the source code, you need to install the following software:

Python 2.7 or higher
Numpy 1.8.1 or higher
Scipy 0.13.0 or higher
matplotlib 1.3.1 or higher
wxPython 3.0 or higher

If you are not familiar with Python or installing libraries, we recommend installing "Enthought Canopy" (free and acamedic versions are okay), a Python distribution containing many scientific libraries including those listed above.

The zipped file contains the following tools:

APP_Categorizer.py: GUI version of Categorizer
Categorizer.py: command-line version of Categorizer
rebuild.py: a tool to build indexes for semantic similarity scores from your data.
GOTermSeeker.py: a simple tool to search for GO terms containing a certain keyword. This tool may be helpful when creating categories.


Execution

Categorizer is shipped with example files that can be found in the ./data folder.
example_categories.txt
example_gene_association.fb
example_genes.txt
example_background_genes.txt


Categorizer (GUI version)


Please note that the compiled version supports use of only a single core, while running from the source code supports use of multiple cores.

Run the pre-compiled version

For those who downloaded the pre-compiled Windows version of Categorizer, double-click CategorizerGUI.exe).

categorizer_exe_icon

Run from the source code

Run the source code named APP_Categorizer.py.

python APP_Categorizer.py

main_window

To do categorization, you need to provide at least three files (highlighted in yellow): a category file, a gene annotation file, and a gene list file. A background gene list file is optional.


The category file contains a list of biological categories and GO terms belonging to each category. We have created three category files that are expected to be used commonly: biological processes, enzyme classification, and cellular localization. If you want to create your own categories, please see Category file format for more detailed information. For instance, we provide the category file of biological processes, which contains 27 categories and these categories can be copied into a custom category file (please see biological_processes.txt in the ./data folder).

Click on the button below "Step 1"; a window used to select a category file will show up. After loading, a list of categories and the number of GO terms belonging to each category are shown.


The annotation file contains gene and protein IDs, their names, and related GO terms. You can download a variety of annotation files from GeneOntology. Categorizer is shipped with a Drosophila gene annotation file, which can be downloaded from FlyBase or GeneOntology.

annotation_file_format

Categorizer reads all the columns, but uses only the three marked columns: IDs and names (green), and GO terms (orange).


This file contains list of gene identifiers. Categorizer loads both gene IDs and gene names (please see the green columns in the above figure), so either identfier may be used.


Categorizer provides an enrichment analysis function that determines which categories are significantly enriched for a given set of genes. For this analysis, a background list of genes is required, which could be a whole genome or a set of genes.


A gene or protein belongs to a particular biological process or set of processes. For example, the protein p53 belongs to both the signaling process category and the transcription category. Categorizer can classify a gene/protein to the single category with the highest similarity score or multiple categories with a similarity score over the user-defined cutoff.

Categorizer is able to utilize multiple cores. This feature is disabled in the pre-compiled version by default due to a compiling issue.


Clicking on the 'RUN' button will begin the computations. Categorizer loads all required files and categorizes user-entered genes. When complete, a window with results will show up.


result_window

Upon completion of categorization, a window like in the figure above will show up.

  1. Category statistics (Left)

    On the left-side, categorization statistics as a pie chart and a table are shown.
    In this figure, the Metaboism category is the largest one, while protein folding is the smallest. There are also uncategorized genes. This could be due to the lack of annotation information for the genes.

  2. Categorization result (Middle)

    In the middle of the result window, a list of all entered genes and their categories with a similarity score are shown. In this example Categorizer was allowed to classify genes into multiple categories with a cutoff value of 0.3; thus some of the genes in the figure belong to more than one category. For example, Rab26 belongs to both the signaling and transport categories.

  3. Enrichment analysis result (Right)

    If background genes are entered, enrichment analysis result will be shown. Statistical enrichment is expressed as a p-value, and the log10(p-value) is shown in the graph. Dark red represents a significantly enriched category. The lower bound of the graph can be adjusted by moving the slider bar up and down, and clicking on the Redraw button.

  4. Save

    You can save graphs and result tables from the menu: Menu > Save results.



Categorizer (non-GUI version)


Please note that this non-GUI version does not support enrichment analysis.


Run the pre-compiled version

Open a DOS terminal, change to the directory where the Categorizer.exe file is, and enter the command below. Please note that this compiled version supports only a single core.

Categorizer.exe -d [category file] -a [annotation file] -i [gene list file] -m/-s [cutoff]

Example:

Categorizer.exe -d .\data\example_categories.txt -a .\data\example_gene_association.fb -i .\data\example_genes.txt -m 0.3


Run from the source code

python Categorizer.py -d [category file] -a [annotation file] -i [gene list file] -m/-s [cutoff] -cpu [integer]  

Example:

python Categorizer.py -d ./data/example_categories.txt -a ./data/example_gene_association.fb -i ./data/example_genes.txt -m 0.3 -cpu 3



Parameters


Additional information

Category file format


The current version of Categorizer contains three example category files: biological processes, cellular localizations, and enzyme functions. However, Categorizer allows users to define their own categories.


Example category file

definition_file_format

The format of a category file is quite simple. Determine a category name and add GO term IDs that belong to the category. In this example file, the Cell cycle category has four GO terms related to cell cycle. # can be used for comments.

Simply speaking, if a gene has one of the annotations of the four terms, say "GO:0000910" (cytokinesis), it will be categorized into Cell cycle. If a gene has an annotation that is close to the defined terms belonging to cell cycle, the Categorizer calculates pairwise semantic similarity scores between the gene's GO term and the four defined GO terms, and takes the maximal score. If the score obtained from cell cycle is larger than those obtained from other categories, the gene will be classified into the cell cycle category. If multiple categories are allowed, the gene will classfied into any categories scoring above the selected cutoff.

If you have trouble in finding the proper GO terms, please use GOTermSeeker.py to find GO terms that contain a keyword of your interest.

python GOTermSeeker.py [ontology file] [keyword] [output file name]

Example:

python GOTermSeeker.py ./data/gene_ontology_ext.obo "cell cycle" ./cell_cycle.txt



Annotation file format


The annotation file contains gene IDs and names, and their annotated GO IDs. Species-specific annotation files as well as integrated annotation files like UniProt can be dowloaded at GeneOntology.

annotation_file_format

The annotation file format is as above. Categorizer reads all the information in the file but uses only the marked three columns: second and third columns for name and fifth column for GO terms. Thus, you can enter either a gene/protein ID or name into Categorizer.

When annotation files are created, these three columns must be provided.



Rebuilding semantic similarity scores with your data


Categorizer employs an algorithm to calculate semantic similarity scores from the occurence of GO terms in the GO annotations of UniProt proteins compiled in 2013. For detailed information, please see the Introduction section and cited articles.

We built indexes to accelerate the calculation performance. When you need to use a different dataset, for example, UniProt without IEA annotations or Human genes only, you need to rebuild the indexes by using rebuild.exe or rebuild.py.


Run the pre-compiled version

Open a DOS terminal, change to the directory where the rebuild.exe file is, and enter the command below.

rebuild.exe [annotation file] [ontology file]

Example:

rebuild.exe .\data\gene_association.goa_uniprot_noiea.txt .\data\gene_ontology_ext.obo



Run from the source code

python rebuild.py [annotation file] [ontology file]

Example:

python rebuild.py ./data/gene_association.goa_uniprot_noiea.txt ./data/gene_ontology_ext.obo



When done, two index files (go_index.txt and go_prob.txt) will be created. Copy these two files over to the Categorizer folder. As Categorizer automatically loads these index files, once copied you do not need to do any additional things.