infoassembly

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Display information about assemblies

Description

infoassembly writes statistics for a sequence assembly.

The initial release is a basic version. More output can be added. Please contact the authors with suggestions.

Algorithm

None.

Usage

Here is a sample session with infoassembly


% infoassembly sam::samspec1.4example.sam stdout -auto 


Read distribution:
------------------

READ_CATEGORY	#READS	Av_Read_Length
all reads	6	10.33
first of pair 	1	9.00
second of pair	1	17.00
pair          	2	13.00
unpaired      	4	9.00

Contig stats:
-------------

CONTIG	LENGTH	Mean_QUAL	#READS	Max DEPTH	Mean_DEPTH	%GC
ref	45	0.00	6	0	0.00	0.00

Go to the input files for this example

Example 2


% infoassembly sam::xxx.sam stdout -auto -qual qualsdist.txt 

Phred encoding: Sanger / Illumina 1.9

Read distribution:
------------------

READ_CATEGORY	#READS	Av_Read_Length
all reads	2	35.00
first of pair 	1	35.00
second of pair	1	35.00
pair          	2	35.00
unpaired      	0

Contig stats:
-------------

CONTIG	LENGTH	Mean_QUAL	#READS	Max DEPTH	Mean_DEPTH	%GC
chr20	62435964	25.36	2	2	1.97	0.00

Go to the input files for this example
Go to the output files for this example

Example 3


% infoassembly sam::../data/samspec1.4example.ref.fasta 


Read distribution:
------------------

READ_CATEGORY	#READS	Av_Read_Length
all reads	6	10.33
first of pair 	1	9.00
second of pair	1	17.00
pair          	2	13.00
unpaired      	4	9.00

Contig stats:
-------------

CONTIG	LENGTH	Mean_QUAL	#READS	Max DEPTH	Mean_DEPTH	%GC
ref	45	0.00	6	3	1.82	46.67

Go to the input files for this example

Example 4


% infoassembly sam::../data/samspec1.4example.ref.fasta -gc gcbias.txt 


Read distribution:
------------------

READ_CATEGORY	#READS	Av_Read_Length
all reads	6	10.33
first of pair 	1	9.00
second of pair	1	17.00
pair          	2	13.00
unpaired      	4	9.00

Contig stats:
-------------

CONTIG	LENGTH	Mean_QUAL	#READS	Max DEPTH	Mean_DEPTH	%GC
ref	45	0.00	6	3	1.82	46.67

Go to the output files for this example

Command line arguments

Display information about assemblies
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-assembly]          assembly   (no help text) assembly value
  [-outassembly]       outassembly (no help text) outassembly value
  [-gcbiasmetricsoutfile] outfile    [*.infoassembly] GC bias metrics
  [-qualvaluesdistoutfile] outfile    [*.infoassembly] Distribution of quality
                                  values for the reads

   Additional (Optional) qualifiers:
   -refsequence        seqset     Reference sequences in the assembly
   -windowsize         integer    [100] The size of windows on the genome that
                                  are used to bin reads. (Any integer value)
   -bisulfite          boolean    [N] If this is true, it is assumed that the
                                  reads were bisulfite treated

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-assembly" associated qualifiers
   -cbegin1            integer    Start of the contig/consensus sequences
   -cend1              integer    End of the contig/consensus sequences
   -iformat1           string     Input assembly format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -idbname1           string     User-provided database name

   "-refsequence" associated qualifiers
   -sbegin             integer    Start of each sequence to be used
   -send               integer    End of each sequence to be used
   -sreverse           boolean    Reverse (if DNA)
   -sask               boolean    Ask for begin/end/reverse
   -snucleotide        boolean    Sequence is nucleotide
   -sprotein           boolean    Sequence is protein
   -slower             boolean    Make lower case
   -supper             boolean    Make upper case
   -scircular          boolean    Sequence is circular
   -squick             boolean    Read id and sequence only
   -sformat            string     Input sequence format
   -iquery             string     Input query fields or ID list
   -ioffset            integer    Input start position offset
   -sdbname            string     Database name
   -sid                string     Entryname
   -ufo                string     UFO features
   -fformat            string     Features format
   -fopenfile          string     Features file name

   "-outassembly" associated qualifiers
   -odirectory2        string     Output directory
   -oformat2           string     Assembly output format

   "-gcbiasmetricsoutfile" associated qualifiers
   -odirectory3        string     Output directory

   "-qualvaluesdistoutfile" associated qualifiers
   -odirectory4        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-assembly]
(Parameter 1)
assembly (no help text) assembly value Assembly of sequence reads  
[-outassembly]
(Parameter 2)
outassembly (no help text) outassembly value Assembly of sequence reads  
[-gcbiasmetricsoutfile]
(Parameter 3)
outfile GC bias metrics Output file <*>.infoassembly
[-qualvaluesdistoutfile]
(Parameter 4)
outfile Distribution of quality values for the reads Output file <*>.infoassembly
Additional (Optional) qualifiers
-refsequence seqset Reference sequences in the assembly Readable set of sequences Required
-windowsize integer The size of windows on the genome that are used to bin reads. Any integer value 100
-bisulfite boolean If this is true, it is assumed that the reads were bisulfite treated Boolean value Yes/No No
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-assembly" associated assembly qualifiers
-cbegin1
-cbegin_assembly
integer Start of the contig/consensus sequences Any integer value 0
-cend1
-cend_assembly
integer End of the contig/consensus sequences Any integer value 0
-iformat1
-iformat_assembly
string Input assembly format Any string  
-iquery1
-iquery_assembly
string Input query fields or ID list Any string  
-ioffset1
-ioffset_assembly
integer Input start position offset Any integer value 0
-idbname1
-idbname_assembly
string User-provided database name Any string  
"-refsequence" associated seqset qualifiers
-sbegin integer Start of each sequence to be used Any integer value 0
-send integer End of each sequence to be used Any integer value 0
-sreverse boolean Reverse (if DNA) Boolean value Yes/No N
-sask boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein boolean Sequence is protein Boolean value Yes/No N
-slower boolean Make lower case Boolean value Yes/No N
-supper boolean Make upper case Boolean value Yes/No N
-scircular boolean Sequence is circular Boolean value Yes/No N
-squick boolean Read id and sequence only Boolean value Yes/No N
-sformat string Input sequence format Any string  
-iquery string Input query fields or ID list Any string  
-ioffset integer Input start position offset Any integer value 0
-sdbname string Database name Any string  
-sid string Entryname Any string  
-ufo string UFO features Any string  
-fformat string Features format Any string  
-fopenfile string Features file name Any string  
"-outassembly" associated outassembly qualifiers
-odirectory2
-odirectory_outassembly
string Output directory Any string  
-oformat2
-oformat_outassembly
string Assembly output format Any string  
"-gcbiasmetricsoutfile" associated outfile qualifiers
-odirectory3
-odirectory_gcbiasmetricsoutfile
string Output directory Any string  
"-qualvaluesdistoutfile" associated outfile qualifiers
-odirectory4
-odirectory_qualvaluesdistoutfile
string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

Input file format

The input is a standard EMBOSS assembly query.

The major assembly sources are files in SAM and BAM format. infoassembly reads a sequnece assembly.

Input files for usage example

File: samspec1.4example.sam

@CO	example from SAM format specification v1.4
@HD	VN:1.3	SO:coordinate
@SQ	SN:ref	LN:45
r001	163	ref	7	30	8M2I4M1D3M	=	37	39	TTAGATAAAGGATACTG	*
r002	0	ref	9	30	3S6M1P1I4M	*	0	0	AAAAGATAAGGATA	*
r003	0	ref	9	30	5H6M	*	0	0	AGCTAA	*	NM:i:1
r004	0	ref	16	30	6M14N5M	*	0	0	ATAGCTTCAGC	*
r003	16	ref	29	30	6H5M	*	0	0	TAGGC	*	NM:i:0
r001	83	ref	37	30	9M	=	7	-39	CAGCGCCAT	*

Input files for usage example 2

File: xxx.sam

@HD	VN:1.0
@SQ	SN:chr20	LN:62435964
@RG	ID:L1	PU:SC_1_10	LB:SC_1	SM:NA12891
@RG	ID:L2	PU:SC_2_12	LB:SC_2	SM:NA12891
read_28833_29006_6945	99	chr20	28833	20	10M1D25M	=	28993	195	AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG	<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<	NM:i:1	RG:Z:L1
read_28701_28881_323b	147	chr20	28834	30	35M	=	28701	-168	ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA	<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<	MF:i:18	RG:Z:L2

Input files for usage example 3

File: samspec1.4example.ref.fasta

>ref
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT

Output file format

infoassembly writes a simple text report.

Output files for usage example 2

File: qualsdist.txt

#QUALITY: Phred quality score
#BASES: Number of bases with the quality score

QUALITY	#BASES
5	1
11	2
14	1
17	2
21	1
22	2
24	1
25	2
26	5
27	53

Output files for usage example 4

File: gcbias.txt

# GC bias metrics as described in GcBiasDetailMetrics class
# in picard  project (picard.sf.net).
# The mean base quality are determined via the error rate
# of all bases of all reads that are assigned to windows of a GC.
GC	WINDOWS	READ_STARTS	MEAN_BASE_QUALITY	NORMALIZED_COVERAGE
20	5	1	8	1.13333
30	5	2	10	2.26667
40	4	0	0	0
50	11	1	0	0.515152
60	4	1	0	1.41667
70	5	0	0	0

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program name Description
assemblyget Get assembly of sequence reads

Author(s)

Mahmut Uludag
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None