EMBOSS: cai manual

cai

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Calculate codon adaptation index

Description

cai calculates the Codon Adaptation Index for a given nucleotide sequence, given a reference codon usage table. The CAI index is a simple, effective measure of synonymous codon usage bias. It index assesses the extent to which selection has been effective in moulding the pattern of codon usage. In that respect it is useful for predicting the level of expression of a gene, for assessing the adaptation of viral genes to their hosts, and for making comparisons of codon usage in different organisms. The index may also give an approximate indication of the likely success of heterologous gene expression.

Algorithm

The CAI index uses a reference set of highly expressed genes from a species to assess the relative merits of each codon. A score for a gene sequence is calculated from the frequency of use of all codons in that gene sequence.

Usage

Here is a sample session with cai


% cai TEMBL:AB009602 
Calculate codon adaptation index
Codon usage file [Eyeast_cai.cut]: 
Output file [ab009602.cai]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

Calculate codon adaptation index
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-seqall]            seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
   -cfile              codon      [Eyeast_cai.cut] Codon usage table name
  [-outfile]           outfile    [*.cai] Output file name

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-seqall" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -scircular1         boolean    Sequence is circular
   -squick1            boolean    Read id and sequence only
   -sformat1           string     Input sequence format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-cfile" associated qualifiers
   -format             string     Data format

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier	Type	Description	Allowed values	Default
Standard (Mandatory) qualifiers
[-seqall] (Parameter 1)	seqall	Nucleotide sequence(s) filename and optional format, or reference (input USA)	Readable sequence(s)	Required
-cfile	codon	Codon usage table name	Codon usage file in EMBOSS data path	Eyeast_cai.cut
[-outfile] (Parameter 2)	outfile	Output file name	Output file	<>*.cai
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-seqall" associated seqall qualifiers
-sbegin1 -sbegin_seqall	integer	Start of each sequence to be used	Any integer value	0
-send1 -send_seqall	integer	End of each sequence to be used	Any integer value	0
-sreverse1 -sreverse_seqall	boolean	Reverse (if DNA)	Boolean value Yes/No	N
-sask1 -sask_seqall	boolean	Ask for begin/end/reverse	Boolean value Yes/No	N
-snucleotide1 -snucleotide_seqall	boolean	Sequence is nucleotide	Boolean value Yes/No	N
-sprotein1 -sprotein_seqall	boolean	Sequence is protein	Boolean value Yes/No	N
-slower1 -slower_seqall	boolean	Make lower case	Boolean value Yes/No	N
-supper1 -supper_seqall	boolean	Make upper case	Boolean value Yes/No	N
-scircular1 -scircular_seqall	boolean	Sequence is circular	Boolean value Yes/No	N
-squick1 -squick_seqall	boolean	Read id and sequence only	Boolean value Yes/No	N
-sformat1 -sformat_seqall	string	Input sequence format	Any string
-iquery1 -iquery_seqall	string	Input query fields or ID list	Any string
-ioffset1 -ioffset_seqall	integer	Input start position offset	Any integer value	0
-sdbname1 -sdbname_seqall	string	Database name	Any string
-sid1 -sid_seqall	string	Entryname	Any string
-ufo1 -ufo_seqall	string	UFO features	Any string
-fformat1 -fformat_seqall	string	Features format	Any string
-fopenfile1 -fopenfile_seqall	string	Features file name	Any string
"-cfile" associated codon qualifiers
-format	string	Data format	Any string
"-outfile" associated outfile qualifiers
-odirectory2 -odirectory_outfile	string	Output directory	Any string
General qualifiers
-auto	boolean	Turn off prompts	Boolean value Yes/No	N
-stdout	boolean	Write first file to standard output	Boolean value Yes/No	N
-filter	boolean	Read first file from standard input, write first file to standard output	Boolean value Yes/No	N
-options	boolean	Prompt for standard and additional values	Boolean value Yes/No	N
-debug	boolean	Write debug output to program.dbg	Boolean value Yes/No	N
-verbose	boolean	Report some/full command line options	Boolean value Yes/No	Y
-help	boolean	Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose	Boolean value Yes/No	N
-warning	boolean	Report warnings	Boolean value Yes/No	Y
-error	boolean	Report errors	Boolean value Yes/No	Y
-fatal	boolean	Report fatal errors	Boolean value Yes/No	Y
-die	boolean	Report dying program messages	Boolean value Yes/No	Y
-version	boolean	Report version number and exit	Boolean value Yes/No	N

Input file format

cai reads a nucleic acid sequence of a gene.

Input files for usage example

Database entry: TEMBL:AB009602

ID   AB009602; SV 1; linear; mRNA; STD; FUN; 561 BP.
XX
AC   AB009602;
XX
DT   15-DEC-1997 (Rel. 53, Created)
DT   14-APR-2005 (Rel. 83, Last updated, Version 2)
XX
DE   Schizosaccharomyces pombe mRNA for MET1 homolog, partial cds.
XX
KW   MET1 homolog.
XX
OS   Schizosaccharomyces pombe (fission yeast)
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae;
OC   Schizosaccharomyces.
XX
RN   [1]
RP   1-561
RA   Kawamukai M.;
RT   ;
RL   Submitted (07-DEC-1997) to the INSDC.
RL   Makoto Kawamukai, Shimane University, Life and Environmental Science; 1060
RL   Nishikawatsu, Matsue, Shimane 690, Japan
RL   (E-mail:kawamuka@life.shimane-u.ac.jp, Tel:0852-32-6587, Fax:0852-32-6499)
XX
RN   [2]
RP   1-561
RA   Kawamukai M.;
RT   "S.pmbe MET1 homolog";
RL   Unpublished.
XX
DR   EnsemblGenomes; SPCC1739.06c; Schizosaccharomyces_pombe.
DR   EnsemblGenomes; SPCC1739.06c.1; Schizosaccharomyces_pombe.
DR   PomBase; SPCC1739.06c.
DR   PomBase; SPCC1739.06c.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..561
FT                   /organism="Schizosaccharomyces pombe"
FT                   /mol_type="mRNA"
FT                   /clone_lib="pGAD GH"
FT                   /db_xref="taxon:4896"
FT   CDS             <1..275
FT                   /codon_start=3
FT                   /transl_table=1
FT                   /product="MET1 homolog"
FT                   /db_xref="GOA:O74468"
FT                   /db_xref="InterPro:IPR000878"
FT                   /db_xref="InterPro:IPR003043"
FT                   /db_xref="InterPro:IPR006366"
FT                   /db_xref="InterPro:IPR012066"
FT                   /db_xref="InterPro:IPR014776"
FT                   /db_xref="InterPro:IPR014777"
FT                   /db_xref="InterPro:IPR016040"
FT                   /db_xref="UniProtKB/Swiss-Prot:O74468"
FT                   /protein_id="BAA23999.1"
FT                   /translation="SMPKIPSFVPTQTTVFLMALHRLEILVQALIESGWPRVLPVCIAE
FT                   RVSCPDQRFIFSTLEDVVEEYNKYESLPPGLLITGYSCNTLRNTA"
XX
SQ   Sequence 561 BP; 135 A; 106 C; 98 G; 222 T; 0 other;
     gttcgatgcc taaaatacct tcttttgtcc ctacacagac cacagttttc ctaatggctt        60
     tacaccgact agaaattctt gtgcaagcac taattgaaag cggttggcct agagtgttac       120
     cggtttgtat agctgagcgc gtctcttgcc ctgatcaaag gttcattttc tctactttgg       180
     aagacgttgt ggaagaatac aacaagtacg agtctctccc ccctggtttg ctgattactg       240
     gatacagttg taataccctt cgcaacaccg cgtaactatc tatatgaatt attttccctt       300
     tattatatgt agtaggttcg tctttaatct tcctttagca agtcttttac tgttttcgac       360
     ctcaatgttc atgttcttag gttgttttgg ataatatgcg gtcagtttaa tcttcgttgt       420
     ttcttcttaa aatatttatt catggtttaa tttttggttt gtacttgttc aggggccagt       480
     tcattattta ctctgtttgt atacagcagt tcttttattt ttagtatgat tttaatttaa       540
     aacaattcta atggtcaaaa a                                                 561
//

Output file format

cai writes the Codon Adaptation Index to the output file.

Output files for usage example

File: ab009602.cai

Sequence: AB009602 CAI: 0.188

Data files

cai requires a reference codon usage table prepared from a set of genes which are known to be highly expressed. This is specified by the -cfile option and must exist in the EMBOSS data directory. The default codon usage table Eyeastcai.cut is the standard set of Saccharomyces cerevisiae highly expressed gene codon frequiencies. Another table (Eschpo_cai.cut) was prepared from a set of Schizosaccharomyces pombe genes by Peter Rice for the S. pombe sequencing team at the Sanger Centre, and is available in the EMBOSS data directory. You should prepare your own codon usage table for your organism of interest.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

. (your current directory)
.embossdata (under your current directory)
~/ (your home directory)
~/.embossdata

Notes

Codons are nucleotide triplet that encode an amino acid residue in a polypeptide chain. There are four possible nucleotides in DNA; adenine (A), guanine (G), cytosine (C) and thymine (T), therefore 64 possible triplets to encode the 20 amino acids plus the translation termination signal. The encoding is therefore redundant, with all but two amino acids coded for by more than one triplet. Organisms often have a particular preference for one of the possible codons for a given amino acid.

Codon preferences reflect a balance between mutational bias and selection for efficiency of translation. In fast-growing microorganisms there are optimal codons that reflect the composition of the genomic tRNA pool and probably help achieve faster translation rates and high accuracy. Such selection is expected to be strong in highly expressed genes, as is the case for Escherichia coli or Saccharomyces cerevisiae. In contrast, codon usage optimization is normally absent in organisms with slower growing rates such as Homo sapiens (human), where codon preferences are determined by mutational biases characteristic to a particular genome.

Various factors are thought to influence codon usage bias in baceteria, including gene expression level already mentioned, %G+C composition (reflecting horizontal gene transfer or mutational bias), GC skew (reflecting strand-specific mutational bias), amino acid conservation, protein hydropathy, transcriptional selection, RNA stability, and optimal growth temperature.

Various methods have been used to analyze codon usage bias. CAI and methods such as the 'frequency of optimal codons' (Fop) are commonly used to predict gene expression levels. Others such as the 'effective number of codons' (Nc) and Shannon entropy are used to measure codon usage evenness, whereas multivariate statistical methods, iincluding correspondence analysis and principal component analysis, may be used to analyze variations in codon usage between genes.

References

Sharp PM., Li W-H. "The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications." Nucleic Acids Research 1987 vol 15, pp 1281-1295.
Synonymous codon usage in bacteria. Curr Issues Mol Biol. 2001 Oct;3(4):91-7.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

Program name	Description
chips	Calculate Nc codon usage statistic
codcmp	Codon usage table comparison
codcopy	Copy and reformat a codon usage table
cusp	Create a codon usage table from nucleotide sequence(s)
syco	Draw synonymous codon usage statistic plot for a nucleotide sequence

Author(s)

Alan Bleasby
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

Written (March 2001) - Alan Bleasby.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None

Wiki

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Input files for usage example

Database entry: TEMBL:AB009602

Output file format

Output files for usage example

File: ab009602.cai

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Author(s)

History

Target users

Comments