Official sites

Asia server:
North America server:



Categorizer is published in BMC Genomics 2014, 15:1091.


For any inquries, please contact us via Dokyun Na ( or Joerg Gsponer (


Pre-compiled Windows version


Python source codes


Categorizer v1.0 is a tool to classify genes into user-defined groups (categories) based on GeneOntology (GO) annotations and their semantic similarities. Most GO-based analysis tools are designed to identify enrichments of individual GO terms in a set of genes, and they frequently output lists of redundant or highly specific GO terms that can be difficult to interpret. Categorizer assigns genes to user-defined categories and calculates p-values for the enrichment of each category. This new tool takes advantage of the hierarchical structure of GO annotations and the semantic similarity between GO terms for a reliable categorization. Categorizer will help experimental and computational biologists analyzing genomic and proteomic data according to their specific interests.

For detailed information on semantic similarities, refer to the supplementary information of our paper and the following articles:

Lord PW, Stevens RD, Brass A, Goble CA. 2003. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19:1275-1283

Wu X, Pang E, Lin K, Pei Z-M. 2013. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS ONE 8:e66745


Categorizer was implemented using a platform-independent programming language, Python. The program can be run on any operating system where Python and the required libraries are installed. For user's convenience, we provide a pre-compiled verson of Categorizer that runs on the Windows operating system. For other operating systems, the program must be run from the source code.


For Windows users, download the three tools below

Please note that this compiled version uses only a single core due to a compiling issue. If you need to categorize thousands of genes/proteins, we recommend running from the source code, which supports use of multiple cores.


For those who want to run from the source code, please download this


In order to run Categorizer from the source code, you need to install the following software:

Python 2.7 or higher
Numpy 1.8.1 or higher
Scipy 0.13.0 or higher
matplotlib 1.3.1 or higher
wxPython 3.0 or higher

If you are not familiar with Python or installing libraries, we recommend installing "Enthought Canopy" (free and acamedic versions are okay), a Python distribution containing many scientific libraries including those listed above.

The zipped file contains the following tools: GUI version of Categorizer command-line version of Categorizer a tool to build indexes for semantic similarity scores from your data. a simple tool to search for GO terms containing a certain keyword. This tool may be helpful when creating categories.


Categorizer is shipped with example files that can be found in the ./data folder.



Categorizer (GUI version)


Please note that the compiled version supports use of only a single core, while running from the source code supports use of multiple cores.

Run the pre-compiled version

For those who downloaded the pre-compiled Windows version of Categorizer, double-click CategorizerGUI.exe).


Run from the source code

Run the source code named



To do categorization, you need to provide at least three files (highlighted in yellow): a category file, a gene annotation file, and a gene list file. A background gene list file is optional.


  • Step 1 Category file

The category file contains a list of biological categories and GO terms belonging to each category. We have created three category files that are expected to be used commonly: biological processes, enzyme classification, and cellular localization. If you want to create your own categories, please see Category file format for more detailed information. For instance, we provide the category file of biological processes, which contains 27 categories and these categories can be copied into a custom category file (please see biological_processes.txt in the ./data folder).

Click on the button below "Step 1"; a window used to select a category file will show up. After loading, a list of categories and the number of GO terms belonging to each category are shown.


  • Step 2 Annotation file

The annotation file contains gene and protein IDs, their names, and related GO terms. You can download a variety of annotation files from GeneOntology. Categorizer is shipped with a Drosophila gene annotation file, which can be downloaded from FlyBase or GeneOntology.


Categorizer reads all the columns, but uses only the three marked columns: IDs and names (green), and GO terms (orange).


  • Step 3 Gene list file

This file contains list of gene identifiers. Categorizer loads both gene IDs and gene names (please see the green columns in the above figure), so either identfier may be used.


  • Step 4 Background gene list file (optional)

Categorizer provides an enrichment analysis function that determines which categories are significantly enriched for a given set of genes. For this analysis, a background list of genes is required, which could be a whole genome or a set of genes.


  • Step 5 Options

A gene or protein belongs to a particular biological process or set of processes. For example, the protein p53 belongs to both the signaling process category and the transcription category. Categorizer can classify a gene/protein to the single category with the highest similarity score or multiple categories with a similarity score over the user-defined cutoff.

Categorizer is able to utilize multiple cores. This feature is disabled in the pre-compiled version by default due to a compiling issue.


  • Step 6 RUN!

Clicking on the 'RUN' button will begin the computations. Categorizer loads all required files and categorizes user-entered genes. When complete, a window with results will show up.


  • Results


Upon completion of categorization, a window like in the figure above will show up.

  1. Category statistics (Left)

    On the left-side, categorization statistics as a pie chart and a table are shown.

    In this figure, the metabolism category is the largest one, while protein folding  is the smallest. There are also uncategorized genes. This could be due to the lack of annotation information for the genes.

  2. Categorization result (Middle)

    In the middle of the result window, a list of all entered genes and their categories with a similarity score are shown. In this example Categorizer was allowed to classify genes into multiple categories with a cutoff value of 0.3; thus some of the genes in the figure belong to more than one category. For example, Rab26 belongs to both the signaling and transport categories.

  3. Enrichment analysis result (Right)

    If background genes are entered, enrichment analysis result will be shown. Statistical enrichment is expressed as a p-value, and the log10(p-value) is shown in the graph. Dark red represents a significantly enriched category. The lower bound of the graph can be adjusted by moving the slider bar up and down, and clicking on the Redraw button.

  4. Save

    You can save graphs and result tables from the menu: Menu > Save results.



Categorizer (non-GUI version)


Please note that this non-GUI version does not support enrichment analysis.


Run the pre-compiled version

Open a DOS terminal, change to the directory where the Categorizer.exe file is, and enter the command below. Please note that this compiled version supports only a single core.

Categorizer.exe -d [category file] -a [annotation file] -i [gene list file] -m/-s [cutoff]


Categorizer.exe -d .\data\example_categories.txt -a .\data\example_gene_association.fb -i .\data\example_genes.txt -m 0.3


Run from the source code

python -d [category file] -a [annotation file] -i [gene list file] -m/-s [cutoff] -cpu [integer]  


python -d ./data/example_categories.txt -a ./data/example_gene_association.fb -i ./data/example_genes.txt -m 0.3 -cpu 3




  • -d [category file]: Category file. Please see Category file format
  • -a [annotation file]: Annotation file. This file can be downloaded from or created using the file format described below.
  • -i [gene list file]: Input file. This file contains a list of gene IDs or names.
  • -m [cutoff]: When this option is specified, Categorizer may classify one gene/protein into multiple categories with a semantic similarity score over a cutoff value (0<cutoff<=1).
  • -s [cutoff]: When this option is specified, Categorizer classifies one gene/protein into only the category which has the highest similarity score and which is over the specified threshold.







    Note: -m and -s are mutually exclusive.

  • -cpu [integer]: Specifies number of cores to be used (default=1). This is disabled in the pre-compiled version.

Additional information

Category file format


The current version of Categorizer contains three example category files: biological processes, cellular localizations, and enzyme functions. However, Categorizer allows users to define their own categories.


Example category file


The format of a category file is quite simple. Determine a category name and add GO term IDs that belong to the category. In this example file, the Cell cycle category has four GO terms related to cell cycle. # can be used for comments.

Simply speaking, if a gene has one of the annotations of the four terms, say "GO:0000910" (cytokinesis), it will be categorized into Cell cycle. If a gene has an annotation that is close to the defined terms belonging to cell cycle, the Categorizer calculates pairwise semantic similarity scores between the gene's GO term and the four defined GO terms, and takes the maximal score. If the score obtained from cell cycle is larger than those obtained from other categories, the gene will be classified into the cell cycle category. If multiple categories are allowed, the gene will classfied into any categories scoring above the selected cutoff.

If you have trouble in finding the proper GO terms, please use to find GO terms that contain a keyword of your interest.

python [ontology file] [keyword] [output file name]


python ./data/gene_ontology_ext.obo "cell cycle" ./cell_cycle.txt



Annotation file format


The annotation file contains gene IDs and names, and their annotated GO IDs. Species-specific annotation files as well as integrated annotation files like UniProt can be dowloaded at GeneOntology.


The annotation file format is as above. Categorizer reads all the information in the file but uses only the marked three columns: second and third columns for name and fifth column for GO terms. Thus, you can enter either a gene/protein ID or name into Categorizer.

When annotation files are created, these three columns must be provided.



Rebuilding semantic similarity scores with your data


Categorizer employs an algorithm to calculate semantic similarity scores from the occurence of GO terms in the GO annotations of UniProt proteins compiled in 2013. For detailed information, please see the Introduction section and cited articles.

We built indexes to accelerate the calculation performance. When you need to use a different dataset, for example, UniProt without IEA annotations or Human genes only, you need to rebuild the indexes by using rebuild.exe or


Run the pre-compiled version

Open a DOS terminal, change to the directory where the rebuild.exe file is, and enter the command below.

rebuild.exe [annotation file] [ontology file]


rebuild.exe .\data\gene_association.goa_uniprot_noiea.txt .\data\gene_ontology_ext.obo



Run from the source code

python [annotation file] [ontology file]


python ./data/gene_association.goa_uniprot_noiea.txt ./data/gene_ontology_ext.obo



When done, two index files (go_index.txt and go_prob.txt) will be created. Copy these two files over to the Categorizer folder. As Categorizer automatically loads these index files, once copied you do not need to do any additional things.