Gclust databases
To Gclust top
Brief summary of the
Gclust software
Gclust software was developed to make clusters of protein sequences
from all predicted protein sequences in a selected set of genomes. The
clusters are homolog groups, but not ortholog clusters (see below for
the distinction), and therefore, contain all homologous sequences
encoded by the selected genomes. An ortholog cluster, such as the one
in KEGG Orthologs or COG in NCBI, contains only a single sequence for
each genome, and such a single representative is usually selected by a
criterion called "bi-directional best hit". By contrast, a homolog
group contains all reliable homologs, that represents a gene family.
However, we need several techniques (see the next section for
specialists) to obtain good homolog groups, because a very large group
of sequences consisting of unrelated sequences could be formed if
similar sequences are simply added to a homolog group. Gclust uses
E-value of BLASTP and overlap score (representing the proportion of
homologous regions shared by two sequences) as a two-dimensional
matrix, to select for the proper E-value and overlap score for each
homolog group so that not too many homologs are put into the group. To
do so, number of organisms is also considered. Detailed explanation of
the algorithm was presented in the GIW2005 paper.
Gclust databases
The Gclust software can be used with any set of genomes. We have been
working on photosynthetic organisms, and are interested in finding
conserved proteins in prokaryotic and eukaryotic photosynthetic
organisms. Therefore, initial datasets included mainly photosynthetic
organisms with some non-photosynthetic organisms for comparison. The
datasets, CZ16Y, CZ20x, CZ30 and CZ35, are such datasets including
different number of genomes. The results of computation are now
available for the public through a web interface. The dataset Bact129
includes 132 species of bacteria. The dataset ALL145 includes animals
and plants (including algae) as well as many bacteria and Archaea.
Organellar genomes are also included, but selection is only possible on
organisms (nuclear genome and mitochondrial genome, as well as
chloroplast genome, if present). For organellar studies, datasets
including all available chloroplast genomes (plus cyanobacterial
genomes) or many mitochondrial genomes of photosynthetic organisms are
also provided in this web site. They are named CPBACT8x and Mt23,
respectively.
Uniqueness of the Gclust software
Many researchers use BLASTP to search homologous sequences in the
non-redundant databases. But the results are usually difficult to
interpret, because many similar sequences rank high. There are
sometimes duplicated entries of an identical sequence. The Gclust
databases are pre-calculated similarity matrices, which show all
homologs in the selected dataset. Users do not need to perform an
iterated BLASTP search.
If you are not satisfied with the provided
clusters, ...
it is time to make clusters by yourself. Use the Gclust software to
make clusters from a genome set containing your favorite organisms.
Currently, detailed usage of the Gclust software as well as its
associated software is being prepared.
Use of the Gclust software (for computational
biologists)
The Gclust software is written in C, and runs on any common UNIX
platforms including Mac OS X.
Memory requirement depends on input data and mode of operation. When
all the 102,513 predicted proteins encoded in the four eukaryotic
(including organelles) and 13 prokaryotic genomes are clustered, about
9 GB memory was used on SGI Onyx3400. The computation of the ALL145
dataset required more than 2 weeks using the supercomputer system in
the Human Genome Center at the University of Tokyo.
In a typical flow of database construction (see the
figure below), a set of genomes is defined as a dataset. The GenBank
format files (gbk files) for the selected genomes are retrieved. Then,
PERL scripts are used to prepare a protein sequence file and an
annotation table for the entire dataset. All sequence manipulations by
the scripts internally invoke the SISEQ commands (Sato, N. 2000.
Bioinformatics 16: 180-181). The protein sequence file is used for all-against-all BLASTP (version 2.2.12)
analysis. The results are parsed by a script to extract significant
homology with 1 x 10-3 as an E-value of cutoff (a3 file).
The a3 file and the annotation table as well as a definition of
organisms are used as inputs for Gclust software. Gclust was first run
in the 'save' mode to prepare an intermediate file 'data.out'. A 'tapering' or 'ashikiri' option is
provided to remove low homology data, with keeping low homology data
for short sequences (from 1e-6 for >100 aa to 1e-3
for <40 aa). In Gclust, homology data are handled as a chunk called sqlist, holding region to region
similarity, namely, coordinates of similarity region in both (query and
target) proteins and E-value. Therefore, a combination of two proteins
may have many sqlist data, depending on the domain structure and
repeat.

In the second step, Gclust reads the data.out file, and
performs
clustering according to various options. However, the most useful
option is the -clique
option, which produces a good clustering result in relatively short
time (within one day). In the clique mode, the sqlist data are
converted to match data,
which hold data of binary (i.e., protein to protein) similarity,
namely, best E-value among
sqlist, overlap score showing
total overlap region devided by total length, and domain structure estimated from
homology segments. Normally, clique mode requires org_list data, listing organisms.
For each protein, all match data are tabulated in 2D, using E-value and
overlap score. Match data are selected one by one starting from the
corner with the highest E-value and highest overlap score. Various
criteria are applied, but essentially, a clearly defined cluster of
match data with respect to E-value and overlap score is selected. In
addition, match data are
selected to cover as many organisms as possible but without picking up
very low similarity data. After such purification of match data, idlist holding list of IDs of
homologs is made for each proein. The threshold E-value and overlap
score are also stored. Then, homolog clusters are formed by merging
individual idlists. At this
stage, idlists with very diffent threshold E-values are not merged.
After a repeat of merging and removing, isolated proteins generated by
removal step are again incorporated into the most adequate cluster.
Homolog groups are sorted according to the number of entries. Finally,
homolog groups are printed out to a large file as a canetated
similarity matrix. The matrix may be expressed in 1 (similar) - 0
(dissimilar), E-value, or overlap score, depending on output options,
1, r, or s, respectively.
Using a perl script homologtableG4.pl,
the homology matrix can be transformed into a table showing members of
each homolog group.
Then, tbsort2 software
(written in C) is used to select homolog groups
that are conserved in a selected set of organisms. We call this
"phylogenetic profiling", which may be useful to extract conserved
proteins of unknown function, which might be involved in the pathway or
process that are shared by the set of organisms. We apply this
principle to extract "chloroplast proteins of endosymbiont origin" or
CPRENDOs. But other usage of the phylogenetic profiling might be
possible.
-- additional old explanation --
In the basic mode with the -hom
option, single-linkage clustering is performed with an E-valueas a threshold. In this case, all
the homologues that are linked by a single homology relationship are
placed in a single cluster. Such clusters are used as discrete
characters to make a parsimony tree (using the PAUP software) that we
call 'genome tree'. With -repeat option, an iterated clustering
is performed by changing the threshold E-value from the lowest (such as
10-50) to the highest (such as 10-3). During the
iteration, an abrupt increase in the number of members of a cluster is
taken as a sign of
formation of unnatural cluster including distantly related or
multidomain
proteins.
An additional criterion is the overlap score,
which is defined as the sum of length of homology regions over the
entire sequences divided by the sum of the lengths of the two sequences
to be
compared.
Another criterion is the complexity of domain
structure,
which is estimated based on the BLASTP data and which is used to
eliminate
multidomain proteins during the initial iteration. By these criteria,
concise
or natural clusters are extracted and removed from further clustering
with
higher E-values. In an additional mode with -homsub
option, the final clusters are further sub-clustered to maximize
similarity within each subgroup.
Example homolog group
Example alignment
Attention! Good targets and unsuited proteins
It is essential that you recognize what you are looking for in the
Gclust database. Gclust database consists of clusters of homologous
proteins. Some proteins belong to large protein families, while others
are orphans. Some proteins are well characterized by experiments, while
others are still in the hypothetical state. The author of Gclust
originally aimed at recognizing conserved hypothetical proteins in
various different phyla. Therefore, a recommended usage of the Gclust
database is to find conserved hypothetical proteins. Another trivial
usage is to get all possible homologs to construct phylogenetic tree.
What is not to be intended is to find a homolog of transcription
factors and kinases. A simple desire to find a homolog of functionally
important molecules may be met by a sophisticated phylogenetic analysis
of all possible homologs. In the Gclust databases, some clusters
containing large protein families are very large and are not well
resolved. The top ten large clusters include DNA-binding proteins,
RNA-binding proteins, serine/threonine-kinases, histidine kinases,
response regulators, components of ABC transporters. I agree that these
are important functional molecules in biological systems, but the
functional classification is not easy.
There are various different reasons that these proteins are not suited
for Gclust database.
First, structurally similar proteins are clustered in the
Gclust software using the results of BLASTP. In the large clusters
consisting of similar sequences, a more rigorous phylogenetic analysis
is necessary to correctly classify homologs. The clusters in the Gclust
database may not correctly reflect phylogenetic clusters.
Second, various transcription factors and kinases contain
additional functional domains. In the Gclust algorithm, multidomain
proteins are often separated. However, many biologists want to obtain
transcription factors having a similar DNA-binding domain, disregarding
additional domains. In this case, Pfam analysis may be more helpful.
Finally, sequence similarity and functional relatedness are
different. Proteins of similar sequences may be involved in different
cellular functions or pathways. Therefore, a single cluster of ABC
transporters contains various transporters involved in transport of
different molecules. Many biologists are disappointed to find such a
situation. However, transporters are similar with one another, even if
they transport different molecules. Structural similarity arising from
phylogenetic relationship may be more apparent than similarity of
substrate binding sites. In this case, the Gclust clusters do not
correspond to functional classification of transporters.
Please keep this attention in mind to exploit the Gclust database.
References
N. Sato (2009)
Gclust:trans-kingdom classification of proteins using automatic individual threshold setting.
Bioinformatics (on-line access) doi: 10.1093/bioinformatics/btp047.
Abstract
N. Sato, M. Ishikawa, M. Fujiwara and K. Sonoike (2005)
Mass identification of chloroplast proteins of endosymbiont origin by
phylogenetic profiling based on organism-optimized homologous protein
groups.
Genome Informatics 16: 56-68.
N. Sato (2003)
Gclust: genome-wide clustering of protein sequences for identification
of photosynthesis-related genes resulting from massive horizontal gene
transfer.
Genome Informatics 14: 585-586.
N. Sato (2002)
Comparative analysis of the genomes of cyanobacteria and plants.
Genome Informatics 13: 173-182.
Data sources
GenBank Databases:
NCBI
Unfinished Genome Data in JGI:
JGI
Cyanidioschyzon merolae Genomic Data
CGP
Distribution of
data and software
All downloads should be done from the Gclust Download page.
2. Gclust software is available as the source code for UNIX.
The software is distributed for academic use.
The copyright is kept by N. Sato.
Re-distribution of the software is not allowed without permission.
If you agree, you may download the software from the links below.
If you download the software, you are automatically assumed to agree
with this condition.
In many scripts, SISEQ commands are used. Install SISEQ package before
using these scripts.
SISEQ package is available from http://nsato4.c.u-tokyo.ac.jp/old/Siseq.html.
Please read the description in the upper part of this document for the
flow of data processing.
2. Example of data processing. Note that this is an old version. See the latest version in the download page.
(1) GenBank file (test.gbk)
---> test.fa and test.p.table (SISEQ is needed)
gbk2ptable.pl test.gbk AB0012345 test
(2) ---> test.gfa and test.g.table
gclustsort4.pl test
(3) ---> test.pin, test.psq, test.phr
formatdb -i test.gfa -n test
(4) BLASTP (You should know how to use blastall.)
blastall -FF -i test.gfa -d test -p blastp -e 0.01 |
bl2ls3.pl - 1e-3 > testa3
Alternatively,
blastall -FF -i test.gfa -d test -p blastp -e 0.01
-o test.result
bl2ls2.pl test.result 1e-3 > testa3
(5)Gclust
gclust testa3 -save -tab=test.g.table -taper
This creates a file data.out.
gclust -read=data.out -hom -thr=1e-20 -out=1
This is a simple clustering using a single cut
off value.
You can use various options ...
gclust -read=data.out -hom -clique -org -regroup2
-out=1rs
This creates three files. testa3.hom.1,
testa3.hom.r and testa3.hom.s.
You also need org_file, which describes
definition of organisms. The names of organisms must be determined in
the step (1).
(6) Table
homologtableG5.pl testa3.hom.1 prefixTEST >
test.tbl
Here, you need a file prefixTEST, which describes
the names of organisms.
(7) TBSORT
tbsort2 test.tbl 12345 out grp_def pat_def 1
'out' is the name of output file. 12345 is the
number of clusters in test.tbl (see the last line). You also need
files, grp_def and pat_def.
(8) Phylogenetic tree (SISEQ is needed)
getgrp.pl testa3.hom.1 list > list.hom
list file contains cluster numbers one
per line.
makefa3b.pl list test.fa
This creates a directory 'seqs', and
multiple FASTA files are created therein.
Then, you may use any alignment software (such as
clustalw, muscle etc) to create an alignment for each FASTA file.
Finally, you can construct phylogenetic tree using a
software whichever you like.
Copyright © 2006-8 Sato Lab. All Rights Reserved.
Last update: Jan. 22, 2009.
Back to Top