Gclust

N. Sato (2009)
Gclust:trans-kingdom classification of proteins using automatic individual threshold setting.
Bioinformatics (on-line access) doi: 10.1093/bioinformatics/btp047. Abstract

N. Sato, M. Ishikawa, M. Fujiwara and K. Sonoike (2005)
Mass identification of chloroplast proteins of endosymbiont origin by phylogenetic profiling based on organism-optimized homologous protein groups.
Genome Informatics 16: 56-68.

N. Sato (2003)
Gclust: genome-wide clustering of protein sequences for identification of photosynthesis-related genes resulting from massive horizontal gene transfer.
Genome Informatics 14: 585-586.

N. Sato (2002)
Comparative analysis of the genomes of cyanobacteria and plants.
Genome Informatics 13: 173-182.

GenBank Databases: NCBI
Unfinished Genome Data in JGI: JGI
Cyanidioschyzon merolae Genomic Data CGP

(1) GenBank file (test.gbk) ---> test.fa and test.p.table (SISEQ is needed)
    gbk2ptable.pl test.gbk AB0012345 test
(2) ---> test.gfa and test.g.table
    gclustsort4.pl test
(3) ---> test.pin, test.psq, test.phr
    formatdb -i test.gfa -n test
(4) BLASTP (You should know how to use blastall.)
    blastall -FF -i test.gfa -d test -p blastp -e 0.01 | bl2ls3.pl - 1e-3 > testa3
    Alternatively,
    blastall -FF -i test.gfa -d test -p blastp -e 0.01 -o test.result
    bl2ls2.pl test.result 1e-3 > testa3
(5)Gclust
    gclust testa3 -save -tab=test.g.table -taper
       This creates a file data.out.
    gclust -read=data.out -hom -thr=1e-20 -out=1
     This is a simple clustering using a single cut off value.
    You can use various options ...
    gclust -read=data.out -hom -clique -org -regroup2 -out=1rs
       This creates three files. testa3.hom.1, testa3.hom.r and testa3.hom.s.
       You also need org_file, which describes definition of organisms. The names of organisms must be determined in the step (1).
(6) Table
    homologtableG5.pl testa3.hom.1 prefixTEST > test.tbl
    Here, you need a file prefixTEST, which describes the names of organisms.
(7) TBSORT
    tbsort2 test.tbl 12345 out grp_def pat_def 1
    'out' is the name of output file. 12345 is the number of clusters in test.tbl (see the last line). You also need files, grp_def and pat_def.
(8) Phylogenetic tree (SISEQ is needed)
    getgrp.pl testa3.hom.1 list > list.hom
       list file contains cluster numbers one per line.
    makefa3b.pl list test.fa
       This creates a directory 'seqs', and multiple FASTA files are created therein.
    Then, you may use any alignment software (such as clustalw, muscle etc) to create an alignment for each FASTA file.
    Finally, you can construct phylogenetic tree using a software whichever you like.

Gclust databases