ddbj bird Gene Trek in Prokaryote Space (GTPS)
Contact us /Japanese
>>more

Procedure of the GTPS database construction
GTPS is acronym of Gene Trek in Prokaryote Space. GTPS is database which is re-annotated against all genome determined prokaryote organisms by doing various analyses. Procedure of the re-annotation such as analysis programs and reference databases are here introduced. Please also refer to the paper of GTPS.
Purpose of GTPS
Various complete genomes of Bacteria and Archaea are registered in DDBJ/EMBL/GenBank of INSD (International Nucleotide Sequence Database). Their annotation information and DNA sequences can be taken from Genome Information Broker (GIB). However there are differences in BLAST threshold and BLAST reference database during the annotation. Therefore the quality of annotation is various and it is difficult to do genome-wide analysis such as comparative genome analysis. Therefore we at DDBJ re-annotate ORFs (Open Reading Frame) and RNAs against genome sequences of Bacteria and Archaea by using common protocol.
Details of GTPS database constructing procedure

Activity nameDescription
1. Masking of RNA and repeat regionMake mask regions which are not target of ORF prediction. There are several types of mask regions. The first one is non-coding RNA regions and another one is repeat region of LTR (Long Terminal Repeat). The non-coding RNA is predicted from both tRNAscan-SE and Rfam database. The mask regions are generated from both RNA and repeat regions.
2. ORF predictionPredict ORF regions on genome by using Glimmer.
3. Analysis of predicted ORFsAnalyze predicted ORFs by using both BLAST and InterProScan.
4. Comparison between predicted ORF and INSD annotationCompare the predicted ORFs with CDS(Coding sequence) of INSD and provide flags such as "Complete match with INSD CDS", "Only 3'end match with INSD CDS" and "There is no INSD CDS matches in 3'end". And all predicted ORFs are examined whether the frame is matched with INSD annotation of pseudo genes or not. Furthermore extract INSD CDS regions which cannot be predicted by Glimmer and analyze them just like predicted ORFs by using both BLAST and InterProScan.
5. Modification of start position of predicted ORFsModify the start position of ORFs to shorten the ORF length and solve the overlaps among ORFs. ORFs whose length is changed are also analyzed just like predicted ORFs by using both BLAST and InterProScan.
6. Grade classification of ORFsAll GTPS ORFs are classified in the viewpoint of ORF certainty by using the result of BLAST and InterProScan.
7. Annotation of gene product nameProvide all GTPS ORFs with gene product name by referring to both INSD annotation and BLAST result. Furthermore motif names and description of GO are also provided by referring to the result of InterProScan.
8. Prediction of IS region and IS nameMap IS sequences to the genomes. The IS sequences are derived from GIB-IS database. Predict IS regions and provide each IS region with IS name.
1. Masking of RNA and repeat region
Make mask regions which are not target of ORF prediction. There are several types of mask regions. The first one is non-coding RNA regions and another one is repeat region of LTR (Long Terminal Repeat). The non-coding RNA is predicted from both tRNAscan-SE and Rfam database. The mask regions are generated from both RNA and repeat regions.
Parent activity: GTPS database constructing procedure

Activity nameDescription
Extraction of non-coding RNA annotation from INSDExtract annotation of non-coding RNA region whose feature's name are rRNA (ribosomal RNA), ncRNA (non-protein-coding RNA), misc_RNA and tmRNA (transfer messenger RNA).
Extraction of repeat annotation from INSDExtract annotation of repeat region whose feature's name is LTR (Long Terminal Repeat).
Prediction of tRNA using tRNAscan-SE Predict tRNA region by using tRNAscan-SE and get the product name of the tRNA. Specify the tRNAscan-SE with the parameter of division which is 'Archaea' (-A) or 'Prokaryote' (-P).
   tRNAscan-SE <division (-A or -P)> <FASTA file>
The division information can be obtained from WABI GIB service. Retrieve the tRNA location on genome and product name from the result of tRNAscan-SE. The annotation is as follows.
Prediction of non-coding RNA using Rfam Extract prokaryote sequence from Rfam database. And execute BLAST with the Rfam data as query sequence and genome sequence as reference data.
   blastall -p blastn -d <database of genome> -i <sequence of Rfam> -e 1.0e-10 -m 8 -F F
Extract alignment regions as non-coding RNA from BLAST result with the condition of both Identity is 100% and full length of Rfam sequence is aligned. Rfam name, ID and product name are provided.

The annotation is as follows.
Integration of all mask regionsRNA of INSD, repeat regions of INSD, result of tRNAscan-SE and result of Rfam mapping are integrated. The mask regions are used as not targeted region of ORF prediction.
Top
2. ORF prediction
Predict ORF regions on genome by using Glimmer.
Parent activity:GTPS database constructing procedure

Activity nameDescription
Glimmer with minimum ORF length of 180 bpExecute Glimmer with minimum ORF length of 180 bp.
Glimmer with minimum ORF length of 45 bpExecute Glimmer with minimum ORF length of 45 bp too.
Integration of two times GlimmerCompare the two times result of Glimmer and integrate the ORF regions in the case of both start and end position are same. If end position is same and start position is different, the longer ORF is adopted and used in the next activity. If end position is different, both ORFs are adopted and used in the next activity.
Top
3. Analysis of predicted ORFs
Analyze predicted ORFs by using both BLAST and InterProScan.
Parent activity:GTPS database constructing procedure

Activity nameDescription
Analysis using BLASTExecute BLAST with predicted ORFs as query and amino acid sequences of DDBJ BCT division as reference database.
   blastall -p blastp -e 0.001 -F F -d <database> -i <amino acid of predicted ORF> -o <result file>
Providing flags from BLAST resultProvide all predicted ORF with flags of BLAST in terms of certainty of gene. When it meets either of the following requirements, it is considered it is hit.
  • E-value is 1e-40 or less and Identity is 30% or more and cover ratio of alignment region to full length of the query or subject is 70% or more.
  • E-value is 0.0001 or less and Identity is 80% or more and cover ratio of alignment region to full length of the query or subject is 80% or more.
  • Identity is 90% or more and cover ratio of alignment region to full length of the query or subject is 90% or more.
BLAST flags
1Only one INSD CDS is found. Its location is same on genome and its annotation is not function unknown*.
2Only one INSD CDS is found. Its location is same on genome and its annotation is function unknown*.
3Plural INSD CDS are found. One of the location is same on genome and some of those annotation are not function unknown*.
4Plural INSD CDS are found. However there is no same CDS in terms of location on genome and some of those annotation are not function unknown*.
5Plural INSD CDS are found. One of the location is same on genome and all of those annotation are function unknown*.
6Plural INSD CDS are found. However there is no same CDS in terms of location on genome. All of their annotations are function annotation*.
7No INSD CDS is found.
*function unknown
Product name contains the keyword 'unknown', 'hypothetical protein', 'probable ORF', 'predicted protein' and so on. Please refer to the page in details.(Japanese only)
Analysis using InterProScanExecute InterProScan with all predicted ORFs.
   iprscan -cli -altjobs -iprlookup -goterms -seqtype p -format raw -i <amino acid of ORF>
Providing flags from InterProScan resultProviding the flags of InterProScan from the result of InterProScan
InterProScan flags
1Motif region is 30% or more and the motif is not invalid*.
2Motif region is less than 30% and the motif is not invalid*.
3All motifs are invalid*.
4There is no motif in ORF.
*invalid
Please see the list of invalid motif.
Top
4. Comparison between predicted ORF and INSD annotation
Compare the predicted ORF with CDS of INSD annotation and provide them with flags such as "same with INSD CDS", "same with INSD CDS in terms of end position" and "There is no INSD CDS whose end position is same with predicted ORF" and so on. And all predicted ORFs are checked whether the frame is same with pseudo genes or not. Furthermore INSD CDS which can not be predicted by Glimmer are extracted and are analyzed by using BLAST and InterProScan just like predicted ORFs.
Parent activity:GTPS database constructing procedure

Activity nameDescription
Comparison between all predicted ORFs and INSD CDSCompare all predicted ORFs with INSD CDS and provide the following flags.
Flags for the predicted ORFs in terms of comparison with INSD CDS
1Both start and end position is same between predicted ORF and INSD CDS.
2Only end position is same between predicted ORF and INSD CDS. The predicted ORF is longer than INSD CDS.
3Only end position is same between predicted ORF and INSD CDS. The predicted ORF is shorter than INSD CDS.
4There is no INSD CDS in terms of end position.
PStart or end position is same between predicted ORF and pseudo gene.
JStart or end position is same between predicted ORF and INDS CDS of join location. Or predicted ORF are overlapped with INSD CDS of join location and frame of amino acid is same.

Comparison between all INSD CDS and predicted ORFsCompare all INSD CDS with predicted ORF and provide the following flags.
Flags for INSD CDS
JCDS of join location
7-1Is not same with predicted ORF in terms of frame and product name is function unknown*.
7-2Is not same with predicted ORF in terms of frame and product name is not function unknown*.
PPseudo CDS
*function unknown
Product name contains the keyword 'unknown', 'hypothetical protein', 'probable ORF', 'predicted protein' and so on. Please refer to the page in details.
Analysis of ORF which can not be predicted by GlimmerExtract INSD CDS which can not be predicted by Glimmer. INSD CDS whose flag is '7-1', '7-2', 'J' are extracted. These CDS are analyzed by using BLAST and InterProScan just like predicted ORFs.
Top
5. Modification of start position of predicted ORFs
Modify the start position of ORF to shorten the length of ORF to solve the overlaps between ORFs. ORFs whose length is changed are analyzed by using BLAST and InterProScan just like predicted ORF.
Parent activity:GTPS database constructing procedure

Activity nameDescription
Extracting certain ORFs from predicted ORFsExtract reliable ORFs from all predicted ORFs. The certainty is derived from BLAST and InterProScan. BLAST flags is 1, 3 or 4 or InterProSacn flags is 1 and flags for the predicted ORFs in terms of comparison with INSD CDS is not P nor J.
Extracting ORF pairs which are overlapped between ORFsExtract ORF pairs which are overlapped by 30 bp or more between ORFs as follows.
Modification of start position of ORFModify the start position of ORF to shorten the length of ORF to solve the overlaps between ORFs. Start position is corrected so that the motif region of ORF is not removed.
Analysis of ORF whose start position is correctedAnalyze ORFs whose start position is corrected by using BLAST and InterProScan just like predicted ORFs.
Top
6. Grade classification of ORFs
All GTPS ORFs are classified in terms of certainty by using the result of BLAST and InterProSan. All GTPS ORFs are composed of predicted ORF by Glimmer, ORFs from INSD which can not be predicted by Glimmer and ORFs whose start position are modified to solve overlaps between predicted ORFs.
Parent activity:GTPS database constructing procedure

Activity nameDescription
Grade classification by BLAST alignment lengthAll GTPS ORFs are classified by using BLAST result. The BLAST is executed with amino acid sequence of all DDBJ BCT division.
Grade by BLAST alignment length
ACover ratio of both query and subject is 70% or more.
BCover ratio of either query or subject is 70% or more.
CExcluding A and B
Grade classification by annotation of BLAST subjectAll GTPS ORFs are classified by using BLAST result.
Grade by annotation of BLAST subject
1Annotation of subject is not function unknown* nor membrane protein*.
2Annotation of subject is not function unknown*. But it is membrane protein*.
3Annotation of all subject is function unknown*.
4No hit found in BLAST result.
*function unknown
Product name contains the keyword 'unknown', 'hypothetical protein', 'probable ORF', 'predicted protein' and so on. Please refer to the page in details.
*membrane protein
Product name contains the keyword 'inner-membrane protein', 'outer membrane protein', 'integral-membrane protein' and so on. Please refer to the file in details.
Grade classification by comparison with INSD CDSAll GTPS ORFs are classified in terms of comparison with INSD CDS.
Grade by comparison with INSD CDS
1Is predicted ORF or ORF whose start position is modified. Both start and end position is same with INSD CDS.
2Is predicted ORF or ORF whose start position is modified. Only end position is same with INSD CDS.
3Is predicted ORF or ORF whose start position is modified. There is no INSD CDS whose end position is same.
4ORF which can not be predicted. It is derived from INSD CDS.
Grade classification by InterProScanAll GTPS ORFs are classified by using the result of InterProScan.
Grade by InterProScan
1Motif region is 30% or more and the description of the motif doesn't contain the keyword 'unknown'.
2Motif region is 30% or more and the description of all motifs contains keyword 'unknown'.
3No motif region found.
Integration of grade informationIntegrate all the above grade information and provide the following grade from AAAA to X to all GTPS ORFs.

Best grade of certain ORF is AAAA. Grade by BLAST alignment length is A, Grade by annotation of BLAST subject is 1 and Grade by InterProScan is also 1. Grade by comparison with INSD CDS is followed and the final grade which is provided to all GTPS ORFs seems like AAAA1 or BBB2.
Top
7. Annotation of gene product names
Provide gene product name to all GTPS ORFs by referring to both INSD annotation and BLAST result. Motif name and description from GO are also provided.
Parent activity:GTPS database constructing procedure

Activity nameDescription
Providing gene product name from INSD annotation If end position of ORF is same with INSD CDS, the product name of CDS is provided as product name of ORF.

However, the product name is not quoted as it is but the name of INSD is processed as follows.
  • Modify the name to the standard expression. For example, "50s ribosomal protein L10" is modified to "50S ribosomal protein" to change from "50s" to "50S". "16S ribobsomal RNA" is modified to "16S ribosomal RNA" to correct the spell miss of "ribobsomal". The list for the modification can be obtained from here. The format of the file is "before modified expression"<TAB>"after modified expression".
  • Remove unnecessary description. For example, "<number> aa long" is unnecessary expression for gene product name. "254 aa long hypothetical enoyl-CoA hydratase" is corrected to "hypothetical enoyl-CoA hydratase". Removal unnecessary expressions are defined by using regular expression and the file can be obtained from here.
  • Remove the first or last character . and ,. For example, "haemolysin expression-modulating protein." is corrected to "haemolysin expression-modulating protein" to remove the last period. The character both [ and ], or ( and ) or single quotation ' in the first and last position are also removed. Furthermore, plural spaces are changed to one space and back slash character ` is changed to apostrophe '.
  • If the INSD annotation is invalid such as "B1306.01 protein" or "Tgh005", gene product name is provided by next activity. Invalid product names are defined by using regular expression and the file can be obtained from here.
  • Compare the name with list of function unknown. If the name agrees with the list, rename the name to "hypothetical protein". For example, "possible orf", "putative orf", "probable orf" are renamed to "hypothetical protein". If the name does not agree with the list, the name is provided to the ORF. The list of function unknown can be obtained from here.
Providing gene product name from BLAST resultTry to provide gene product name from BLAST result if cover ratio of both subject and query is 70% or more. The BLAST is executed against all amino acid sequence of DDBJ BCT division. Providing gene product name is done just like providing it from INSD annotation. If there are plural BLAST hit and one of the subjects has product name which is not 'hypothetical protein', it is provided to the ORF. If all subjects have 'hypothetical protein' annotation, 'hypothetical protein' is provided. If all subjects have invalid product name such as 'B1306.01 protein' or 'Tgh005' or no hit found in BLAST result, 'predicted in CGM' is provided to the ORF.
Providing annotation from InterProScan resultProvide motif name and InterPro ID from InterProScan result. Furthermore description and ID of Gene Ontology from InterPro ID are also provided by using both interpro2go and GO database.
Integration of annotationFlags and annotation from BLAST and InterProScan are integrated. Flag information is summarized as follows.
Flag information for ORF

a
  • N: Normal predicted ORF which is not same with INSD CDS of join location or pseudo gene in terms of frame
  • P: Predicted ORF whose end position is same with INSD CDS of pseudo gene
  • J: Predicted ORF whose end position is same with INSD CDS of join location
The above information is derived from Flags for the predicted ORFs in terms of comparison with INSD CDS.
b
  • 1: Glimmer with 180 bp of minimum ORF length
  • 2: Glimmer with 45 bp of minimum ORF length
c
  • 1: Same between Glimmer results of two times.
  • 2: Only end position is same with another Glimmer result which is different in minimum ORF length. This ORF is longer than another Glimmer result.
  • 3: Only end position is same with another Glimmer result which is different in minimum ORF length. This ORF is shorter than another Glimmer result.
  • 4: End position is different between Glimmer results of two times.
The above flag is derived from integration of Glimmer result.
d (This flag has not been used now.)
  • 1: RBS exists.
  • 2: RBS does not exist.
  • 3: Out of target for searching RBS
e
  • 1. Same with INSD CDS.
  • 2. Only end position is same with INSD CDS.
  • 3. There is no INSD CDS whose end position is same.
fDerived from BLAST flags.
gDerived from InterProScan flags.

All information including flags is summarized with flat file format as follows.
Top
8. Prediction of IS region and IS name
Predict IS(Insertion sequence) region and provide IS name by mapping it to genomes. The IS sequence is derived from GIB-IS database.
Parent activity:GTPS database constructing procedure

Activity nameDescription
BLAST using both IS sequence and genomeExecute BLAST with genome sequence as reference database, IS sequence as query. IS sequences are all entries from GIB-IS database.
   blastall -p blastn -e 0.001 -F F -d <database of genome sequence> -i <IS sequence> -o <result file>
Mapping IS sequence to genomeDetermine IS regions on genome by mapping IS sequence to genome. The strand of IS annotation is direct if the alignment is in the side of complement. The threshold for mapping is both alignment length for full length of query (cover ratio) is 90% or more and Identity of alignment region is 90% or more.
Integration of overlapped IS regionsIS sequences are mapped with overlap like following diagram. These regions are integrated and make them into one IS region.
Providing IS name Provide IS name with each IS region on genome. The repeat_region feature is used and the annotation is summarized as follows.
Top
Build training model and predict ORF regions by using Glimmer 3.02. The following procedure is almost same with the script g3-iterated.csh including Glimmer package. The different point is -t option of long-orf command in extracting ORFs for the training model and the parameter of minimum length of ORF. And the parameter translation table number and molecular form are also changed.
Predicting ORF regions was done by using both Glimmer2 and RBSfinder when paper was published. However we have been used only Glimmer3 since 2006 because Glimmer3 can build RBS training models itself.
Parent activity:2. ORF prediction

Activity nameDescription
Making sequence for training modelExtract positions of ORF on genomes for training model. The ORF regions are long and are not overlapped between ORFs.
   long-orfs -t 1.08 --no_header <FASTA file of genome> <tag name>.longorfs
Extracting sequences for training modelExtract sequences for training model by using genome and position information on genome from previous activity.
   extract -t <FASTA file of genome> <tag name>.longorfs --nostop > <tag name>.train
Building training modelTraining model:ICM(Interpolated Context Model) is built by using sequences of previous activity.
   build-icm -r <model file> < <tag name>.icm
1st GlimmerFirst prediction of ORF regions is executed by using training model of previous activity. Overlap length of ORFs (-o) is 50 bp and threshold (-t) is specified 30. The translation number can be retrieved from WABI TxSearch service.
   glimmer3 -o 50 -t 30 -g <ORF minimum length(180 or 45)> -l (Specify the -l option if the molecular form is linear. No need if the molecular form is circular.) -z <translation table number(11 or 4) > <FASTA file of genome> <file of training model> <tag name>
Extracting ORF positions from the result of 1st GlimmerExtract positions of ORFs on genome from result of 1st Glimmer.
   tail -n +2 <result of 1st Glimmer> > <file of ORF positions on genome>
Building RBS training modelExtract upstream 25 bp of each predicted ORF from 1st Glimmer result and build training model for RBS (Ribosome Binding Site) of 6 residues. The training model for RBS is called Position Weight Matrix.
   upstream-coords.awk 25 0 <file of ORF positions on genomes> | extract <FASTA file of genome> - > <tag name>.upstream; elph <tag name>.upstream LEN=6 | get-motif-counts.awk > <training model for RBS>
Training model for RBS (Position Weight Matrix) is generated as follows.
Research for start codon distribution Start codon distribution is examined by using predicted ORFs of first Glimmer result. The distribution of start codon atg, gtg, ttg is retrieved such as 0.810, 0.139 and 0.051.
   start-codon-distrib -3 <FASTA file of genome> <ORF positions of 1st Glimmer result>
2nd GlimmerExecute 2nd Glimmer by using training model(ICM), training model for RBS (Position Weight Matrix) and start codon distribution from previous activity.
   glimmer3 -o 50 -t 30 -g <ORF minimum length(180 or 45)> -l (Specify the -l option if the molecular form is linear. No need if the molecular is circular.) -i <mask file> -z <translation table number (11 or 4)> -b <training model for RBS> -P <start codon distribution> <FASTA file of genome> <training model> <result file>
Top