GTPS is acronym of Gene Trek in Prokaryote Space. GTPS is database which is re-annotated against all
genome determined prokaryote organisms by doing various analyses. Procedure of
the re-annotation such as analysis programs and reference databases are here introduced.
Please also refer to
the paper of GTPS.
Purpose of GTPS
Various complete genomes of Bacteria and Archaea are registered in DDBJ/EMBL/GenBank of INSD
(International Nucleotide Sequence Database). Their annotation information and DNA sequences
can be taken from
Genome Information Broker (GIB). However there are differences in BLAST threshold and
BLAST reference database during the annotation. Therefore the quality of annotation is
various and it is difficult to do genome-wide analysis such as comparative genome analysis.
Therefore we at DDBJ re-annotate ORFs (Open Reading Frame) and RNAs against genome sequences
of Bacteria and Archaea by using common protocol.
Details of GTPS database constructing procedure

| Activity name | Description |
|---|---|
| 1. Masking of RNA and repeat region | Make mask regions which are not target of ORF prediction. There are several types of mask regions. The first one is non-coding RNA regions and another one is repeat region of LTR (Long Terminal Repeat). The non-coding RNA is predicted from both tRNAscan-SE and Rfam database. The mask regions are generated from both RNA and repeat regions. |
| 2. ORF prediction | Predict ORF regions on genome by using Glimmer. |
| 3. Analysis of predicted ORFs | Analyze predicted ORFs by using both BLAST and InterProScan. |
| 4. Comparison between predicted ORF and INSD annotation | Compare the predicted ORFs with CDS(Coding sequence) of INSD and provide flags such as "Complete match with INSD CDS", "Only 3'end match with INSD CDS" and "There is no INSD CDS matches in 3'end". And all predicted ORFs are examined whether the frame is matched with INSD annotation of pseudo genes or not. Furthermore extract INSD CDS regions which cannot be predicted by Glimmer and analyze them just like predicted ORFs by using both BLAST and InterProScan. |
| 5. Modification of start position of predicted ORFs | Modify the start position of ORFs to shorten the ORF length and solve the overlaps among ORFs. ORFs whose length is changed are also analyzed just like predicted ORFs by using both BLAST and InterProScan. |
| 6. Grade classification of ORFs | All GTPS ORFs are classified in the viewpoint of ORF certainty by using the result of BLAST and InterProScan. |
| 7. Annotation of gene product name | Provide all GTPS ORFs with gene product name by referring to both INSD annotation and BLAST result. Furthermore motif names and description of GO are also provided by referring to the result of InterProScan. |
| 8. Prediction of IS region and IS name | Map IS sequences to the genomes. The IS sequences are derived from GIB-IS database. Predict IS regions and provide each IS region with IS name. |
1. Masking of RNA and repeat region
Make mask regions which are not target of ORF prediction. There are several types of mask regions. The first
one is non-coding RNA regions and another one is repeat region of LTR (Long Terminal Repeat). The
non-coding RNA is predicted from both tRNAscan-SE and Rfam database. The mask regions are generated from both
RNA and repeat regions.
Parent activity: GTPS database constructing procedure

Top
Parent activity: GTPS database constructing procedure

| Activity name | Description |
|---|---|
| Extraction of non-coding RNA annotation from INSD | Extract annotation of non-coding RNA region whose feature's name are rRNA (ribosomal RNA), ncRNA (non-protein-coding RNA), misc_RNA and tmRNA (transfer messenger RNA). |
| Extraction of repeat annotation from INSD | Extract annotation of repeat region whose feature's name is LTR (Long Terminal Repeat). |
| Prediction of tRNA using tRNAscan-SE | Predict tRNA region by using tRNAscan-SE and get the product name of the tRNA. Specify the tRNAscan-SE
with the parameter of division which is 'Archaea' (-A) or 'Prokaryote' (-P). tRNAscan-SE <division (-A or -P)> <FASTA file> The division information can be obtained from WABI GIB service. Retrieve the tRNA location on genome and product name from the result of tRNAscan-SE. The annotation is as follows. ![]() |
| Prediction of non-coding RNA using Rfam | Extract prokaryote sequence from Rfam database. And execute BLAST with the Rfam data
as query sequence and genome sequence as reference data. blastall -p blastn -d <database of genome> -i <sequence of Rfam> -e 1.0e-10 -m 8 -F F Extract alignment regions as non-coding RNA from BLAST result with the condition of both Identity is 100% and full length of Rfam sequence is aligned. Rfam name, ID and product name are provided. The annotation is as follows. ![]() |
| Integration of all mask regions | RNA of INSD, repeat regions of INSD, result of tRNAscan-SE and result of Rfam mapping are integrated. The mask regions are used as not targeted region of ORF prediction. |
2. ORF prediction
Predict ORF regions on genome by using Glimmer.
Parent activity:GTPS database constructing procedure

Top
Parent activity:GTPS database constructing procedure

| Activity name | Description |
|---|---|
| Glimmer with minimum ORF length of 180 bp | Execute Glimmer with minimum ORF length of 180 bp. |
| Glimmer with minimum ORF length of 45 bp | Execute Glimmer with minimum ORF length of 45 bp too. |
| Integration of two times Glimmer | Compare the two times result of Glimmer and integrate
the ORF regions in the case of both start and end position are same. If end position is same and start
position is different, the longer ORF is adopted and used in the next activity. If end position is different,
both ORFs are adopted and used in the next activity.
|
3. Analysis of predicted ORFs
Analyze predicted ORFs by using both BLAST and InterProScan.
Parent activity:GTPS database constructing procedure

Parent activity:GTPS database constructing procedure

| Activity name | Description | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Analysis using BLAST | Execute BLAST with predicted ORFs as query and amino acid sequences of
DDBJ BCT division as reference database. blastall -p blastp -e 0.001 -F F -d <database> -i <amino acid of predicted ORF> -o <result file> | ||||||||||||||
| Providing flags from BLAST result | Provide all predicted ORF with flags of BLAST in terms of
certainty of gene. When it meets either of the following requirements, it is considered it is hit.
BLAST flags
*function unknown
Product name contains the keyword 'unknown', 'hypothetical protein', 'probable ORF', 'predicted protein' and so on. Please refer to the page in details.(Japanese only) | ||||||||||||||
| Analysis using InterProScan | Execute InterProScan with all predicted ORFs. iprscan -cli -altjobs -iprlookup -goterms -seqtype p -format raw -i <amino acid of ORF> | ||||||||||||||
| Providing flags from InterProScan result | Providing the flags of InterProScan from the result of
InterProScan
InterProScan flags
Please see the list of invalid motif.
|
4. Comparison between predicted ORF and INSD annotation
Compare the predicted ORF with CDS of INSD annotation and provide them with flags such as "same with INSD CDS",
"same with INSD CDS in terms of end position" and "There is no INSD CDS whose end position is same with
predicted ORF" and so on. And all predicted ORFs are checked whether the frame is same with pseudo genes or
not. Furthermore INSD CDS which can not be predicted by Glimmer are extracted and are analyzed by using
BLAST and InterProScan just like predicted ORFs.
Parent activity:GTPS database constructing procedure

Parent activity:GTPS database constructing procedure

| Activity name | Description | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Comparison between all predicted ORFs and INSD CDS | Compare all predicted ORFs with INSD CDS and
provide the following flags.
Flags for the predicted ORFs in terms of comparison with INSD CDS
| ||||||||||||
| Comparison between all INSD CDS and predicted ORFs | Compare all INSD CDS with predicted ORF and
provide the following flags.
Flags for INSD CDS
*function unknown
Product name contains the keyword 'unknown', 'hypothetical protein', 'probable ORF', 'predicted protein' and so on. Please refer to the page in details. | ||||||||||||
| Analysis of ORF which can not be predicted by Glimmer | Extract INSD CDS which can not be predicted by Glimmer. INSD CDS whose flag is '7-1', '7-2', 'J' are extracted. These CDS are analyzed by using BLAST and InterProScan just like predicted ORFs. |
5. Modification of start position of predicted ORFs
Modify the start position of ORF to shorten the length of ORF to solve the overlaps between ORFs. ORFs whose
length is changed are analyzed by using BLAST and InterProScan just like predicted ORF.
Parent activity:GTPS database constructing procedure

Parent activity:GTPS database constructing procedure

| Activity name | Description |
|---|---|
| Extracting certain ORFs from predicted ORFs | Extract reliable ORFs from all predicted ORFs. The certainty is derived from BLAST and InterProScan. BLAST flags is 1, 3 or 4 or InterProSacn flags is 1 and flags for the predicted ORFs in terms of comparison with INSD CDS is not P nor J. |
| Extracting ORF pairs which are overlapped between ORFs | Extract ORF pairs which are overlapped
by 30 bp or more between ORFs as follows.![]() |
| Modification of start position of ORF | Modify the start position of ORF to shorten the length
of ORF to solve the overlaps between ORFs. Start position is corrected so that the motif region of ORF is
not removed.![]() |
| Analysis of ORF whose start position is corrected | Analyze ORFs whose start position is corrected by using BLAST and InterProScan just like predicted ORFs. |
6. Grade classification of ORFs
All GTPS ORFs are classified in terms of certainty by using the result of BLAST and InterProSan. All
GTPS ORFs are composed of predicted ORF by Glimmer, ORFs from INSD which can not be predicted by Glimmer and
ORFs whose start position are modified to solve overlaps between predicted ORFs.
Parent activity:GTPS database constructing procedure

Parent activity:GTPS database constructing procedure

| Activity name | Description | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Grade classification by BLAST alignment length | All GTPS ORFs are classified by using BLAST result.
The BLAST is executed with amino acid sequence of all DDBJ BCT division. Grade by BLAST alignment length
| ||||||||
| Grade classification by annotation of BLAST subject | All GTPS ORFs are classified by using BLAST
result.
Grade by annotation of BLAST subject
*function unknown
Product name contains the keyword 'unknown', 'hypothetical protein', 'probable ORF', 'predicted protein' and so on. Please refer to the page in details.
*membrane protein
Product name contains the keyword 'inner-membrane protein', 'outer membrane protein', 'integral-membrane protein' and so on. Please refer to the file in details. | ||||||||
| Grade classification by comparison with INSD CDS | All GTPS ORFs are classified in terms of
comparison with INSD CDS.
Grade by comparison with INSD CDS
| ||||||||
| Grade classification by InterProScan | All GTPS ORFs are classified by using the result of
InterProScan.
Grade by InterProScan
| ||||||||
| Integration of grade information | Integrate all the above grade information and
provide the following grade from AAAA to X to all GTPS ORFs.![]() Best grade of certain ORF is AAAA. Grade by BLAST alignment length is A, Grade by annotation of BLAST subject is 1 and Grade by InterProScan is also 1. Grade by comparison with INSD CDS is followed and the final grade which is provided to all GTPS ORFs seems like AAAA1 or BBB2. |
7. Annotation of gene product names
Provide gene product name to all GTPS ORFs by referring to both INSD annotation and BLAST result.
Motif name and description from GO are also provided.
Parent activity:GTPS database constructing procedure

Parent activity:GTPS database constructing procedure

| Activity name | Description | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Providing gene product name from INSD annotation | If end position of ORF is same with INSD CDS, the product name of CDS is provided as product name
of ORF.![]() However, the product name is not quoted as it is but the name of INSD is processed as follows.
| ||||||||||||||
| Providing gene product name from BLAST result | Try to provide gene product name from BLAST result if cover ratio of both subject and query is 70% or more. The BLAST is executed against all amino acid sequence of DDBJ BCT division. Providing gene product name is done just like providing it from INSD annotation. If there are plural BLAST hit and one of the subjects has product name which is not 'hypothetical protein', it is provided to the ORF. If all subjects have 'hypothetical protein' annotation, 'hypothetical protein' is provided. If all subjects have invalid product name such as 'B1306.01 protein' or 'Tgh005' or no hit found in BLAST result, 'predicted in CGM' is provided to the ORF. | ||||||||||||||
| Providing annotation from InterProScan result | Provide motif name and InterPro ID from
InterProScan result. Furthermore description and ID of Gene Ontology from InterPro ID are also provided by using both interpro2go and GO database. | ||||||||||||||
| Integration of annotation | Flags and annotation from BLAST and InterProScan are integrated.
Flag information is summarized as follows. Flag information for ORF
![]()
All information including flags is summarized with flat file format as follows.
|
8. Prediction of IS region and IS name
Predict IS(Insertion sequence) region and provide IS name by mapping it to genomes. The IS sequence is
derived from GIB-IS database.
Parent activity:GTPS database constructing procedure

Parent activity:GTPS database constructing procedure

| Activity name | Description |
|---|---|
| BLAST using both IS sequence and genome | Execute BLAST with genome sequence as reference database,
IS sequence as query. IS sequences are all entries from
GIB-IS database.
blastall -p blastn -e 0.001 -F F -d <database of genome sequence> -i <IS sequence> -o <result file> |
| Mapping IS sequence to genome | Determine IS regions on genome by mapping IS sequence to genome.
The strand of IS annotation is direct if the alignment is in the side of complement. The threshold for mapping
is both alignment length for full length of query (cover ratio) is 90% or more and Identity of alignment region is
90% or more.
|
| Integration of overlapped IS regions | IS sequences are mapped with overlap like following diagram.
These regions are integrated and make them into one IS region.
|
| Providing IS name |
Provide IS name with each IS region on genome. The repeat_region feature is used and the annotation is
summarized as follows.
|
9. Glimmer
Build training model and predict ORF regions by using
Glimmer 3.02. The following procedure is almost same with the script g3-iterated.csh including
Glimmer package. The different point is -t option of long-orf command in extracting ORFs for the training
model and the parameter of minimum length of ORF. And the parameter translation table number and molecular
form are also changed.
Predicting ORF regions was done by using both Glimmer2 and RBSfinder when paper was published. However we have been used only Glimmer3 since 2006 because Glimmer3 can build RBS training models itself.
Parent activity:2. ORF prediction

Top
Predicting ORF regions was done by using both Glimmer2 and RBSfinder when paper was published. However we have been used only Glimmer3 since 2006 because Glimmer3 can build RBS training models itself.
Parent activity:2. ORF prediction

| Activity name | Description |
|---|---|
| Making sequence for training model | Extract positions of ORF on genomes for training model.
The ORF regions are long and are not overlapped between ORFs. long-orfs -t 1.08 --no_header <FASTA file of genome> <tag name>.longorfs |
| Extracting sequences for training model | Extract sequences for training model by using genome
and position information on genome from previous activity. extract -t <FASTA file of genome> <tag name>.longorfs --nostop > <tag name>.train |
| Building training model | Training model:ICM(Interpolated Context Model) is built by using
sequences of previous activity. build-icm -r <model file> < <tag name>.icm |
| 1st Glimmer | First prediction of ORF regions is executed by using training model of
previous activity. Overlap length of ORFs (-o) is 50 bp and threshold (-t) is specified 30. The translation number can be retrieved from WABI TxSearch service. glimmer3 -o 50 -t 30 -g <ORF minimum length(180 or 45)> -l (Specify the -l option if the molecular form is linear. No need if the molecular form is circular.) -z <translation table number(11 or 4) > <FASTA file of genome> <file of training model> <tag name> |
| Extracting ORF positions from the result of 1st Glimmer | Extract positions of ORFs on genome
from result of 1st Glimmer. tail -n +2 <result of 1st Glimmer> > <file of ORF positions on genome> |
| Building RBS training model | Extract upstream 25 bp of each predicted ORF from 1st
Glimmer result and build training model for RBS (Ribosome Binding Site) of 6 residues. The training
model for RBS is called Position Weight Matrix. upstream-coords.awk 25 0 <file of ORF positions on genomes> | extract <FASTA file of genome> - > <tag name>.upstream; elph <tag name>.upstream LEN=6 | get-motif-counts.awk > <training model for RBS> Training model for RBS (Position Weight Matrix) is generated as follows.
|
| Research for start codon distribution |
Start codon distribution is examined by using predicted ORFs of first Glimmer result. The distribution of
start codon atg, gtg, ttg is retrieved such as 0.810, 0.139 and 0.051. start-codon-distrib -3 <FASTA file of genome> <ORF positions of 1st Glimmer result> |
| 2nd Glimmer | Execute 2nd Glimmer by using training model(ICM), training model for RBS
(Position Weight Matrix) and start codon distribution from previous activity. glimmer3 -o 50 -t 30 -g <ORF minimum length(180 or 45)> -l (Specify the -l option if the molecular form is linear. No need if the molecular is circular.) -i <mask file> -z <translation table number (11 or 4)> -b <training model for RBS> -P <start codon distribution> <FASTA file of genome> <training model> <result file> |








