htseq-count is a Python script, distributed together with the HTSeq Python library developed by Simon Anders at EMBL Heidelberg.
This module uses HTSeq v0.11.2 via the biocontainers HTSeq.count image biocontainers/htseq:v0.11.2-1-deb-py3_cv1.
HTSeq.Count was originally wrapped as a GenePattern module by the staff of VIB BioinformaticsCore, and then "Dockerized" by the GenePattern team for use on Docker enabled GenePattern servers.
GenePattern Module wrapping: Barbara Hill, GenePattern Team; Guy Bottu, VIB BioinformaticsCore
Special care must be taken to decide how to deal with reads that align to or overlap with more than one feature. HTSeq.Count allows the user to choose between three modes, which work as follows : For each position i in the read, a set S(i) is defined as the set of all features overlapping position i. Then, consider the set S, which is (with i running through all position within the read or a read pair) either :
The following figure illustrates the effect of these three modes:
|input file*||Input file(s) in SAM or BAM format.||SAM or BAM format file|
|sample names||Text file with the names of the samples, one per line (optional and only relevant if you request Excel or GCT format). The names in the file must be in the same order as the input SAM/BAM files. If you do not provide a file the sample names will be deduced from the SAM/BAM file names.||valid file|
|GTF file*||A GTF or GFF file containing a list of gene model annotations.This file can be gzipped.||selection from dynamic list or valid file|
|strandedness*||none : a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature.
forward : a single-end read has to be mapped to the same strand as the feature, for paired-end reads the first read has to be on the same strand and the second read on the opposite strand.
reverse : the above rules are reversed.
set to 'none' if your input file contains no strand information.
|output file*||output file name - format will be Inputfile_Basename.Output_Filename (if you choose GCT format .gct will automatically be added)||HTSeq.counts|
|output format*||raw HTSeq format|
|min qual*||Minimum quality to accept a read.||min = 0||10|
|mode*||Mode to handle reads overlapping more than one feature. See above for a full explanation.||
|count nonunique*||Whether to count reads that are not uniquely aligned or are ambiguously assigned to features.||no|
|count secondary*||Whether to count secondary alignments (which are marked in the SAM/BAM file by a 0x100 flag).||no|
|count supplementary*||Whether to count supplementary alignments (which are marked in the SAM/BAM file by a 0x800 flag).||no|
|id type*||GTF/GFF attribute used to group features.||gene_id|
|gene name||GTF/GFF attribute with the name of the gene or some other information that can help to identify the gene in a more user-friendly way than the ID (optional). If you fill this in an extra column will be added to the output table. For Ensembl data gene_name is suitable.|
|feature type*||Name in the 3th column of the GTF/GFF input file that is used to identify the features that must be counted.||exon|
|__no_feature||number of reads (or read pairs) that were labeled no_feature because they could not be assigned to any feature, see Description for more explanation|
|__ambiguous||number of reads (or read pairs) that were labeled ambiguous, because they could have been assigned to more than one feature, see Description for more explanation|
|__too_low_aQual||number of reads (or read pairs) that were skipped due to having low quality according to the "min qual" parameter|
|__not_aligned||number of reads (or read pairs) in the SAM/BAM file without alignment|
|__alignment_not_unique||number of reads (or read pairs) with more than one reported alignment. These reads are recognized from the NH optional SAM field tag (if the aligner does not set this field, multiple aligned reads will be counted multiple times).|
If you request the Excel output format the module will add a header with the sample name on top of each column with read counts.
If you request GCT
the module will format the output file accordingly, and will redirect the summary lines to stdout.txt.
If you have separate .count files output from multiple runs of HTSeq.Count, you can use the GenePattern module, MergeHTSeqCounts to combine those files into a single GCT output file.
Example output files can be found in the module Git repository.
|1||2016-07-15||for htseq-count of HTSeq 0.6.1p2|
|2||2018-02-22||for htseq-count of HTSeq 0.9.1, allows multiple input files|
|2.1||2019-08-26||for htseq-count of HTSeq 0.11.1, uses Python 3 and has other defaults|
|3.0||2020-10-31||Dockerized version of the module using biocontainers/htseq:v0.11.2-1-deb-py3_cv1, and adding python wrapper to detect format of and name sort alignment files.|