HTSeq.Count

Summary

Takes alignment files in SAM/BAM format and a feature file in GTF/GFF format (usually with exon annotation), name sorts the alignment file(s) and calculates for each sample the number of reads mapping to each feature. Can optionally produce output in GCT format.

Authorship

htseq-count is a Python script, distributed together with the HTSeq Python library developed by Simon Anders at EMBL Heidelberg.
This module uses HTSeq v0.11.2 via the biocontainers HTSeq.count image biocontainers/htseq:v0.11.2-1-deb-py3_cv1.
HTSeq.Count was originally wrapped as a GenePattern module by the staff of VIB BioinformaticsCore, and then "Dockerized" by the GenePattern team for use on Docker enabled GenePattern servers.
GenePattern Module wrapping: Barbara Hill, GenePattern Team; Guy Bottu, VIB BioinformaticsCore

References

Anders S, Pyl PT, Huber W : HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166-169 (2015) PubMed 25260700

Contact

Please post in the GenePattern Forum for assistance with this module.

Description

Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature. A feature is here an interval (i.e. a range of positions) on a chromosome or a union of such intervals. In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. One may also consider each exon as a feature, e.g., in order to check for alternative splicing. For comparative ChIP-Seq, the features might be binding region from a pre-determined list.

Special care must be taken to decide how to deal with reads that align to or overlap with more than one feature. HTSeq.Count allows the user to choose between three modes, which work as follows : For each position i in the read, a set S(i) is defined as the set of all features overlapping position i. Then, consider the set S, which is (with i running through all position within the read or a read pair) either :

the union of all the sets S(i). This is recommended for most cases.
the intersection of all the sets S(i)
the intersection of all non-empty sets S(i)

If S contains precisely one feature, the read (or read pair) is counted for this feature. If S is empty, the read (or read pair) is counted as no_feature. If S contains more than one feature, HTSeq.Count behaves differently based on the setting of the count nonunique parameter :

no (the default) : the read (or read pair) is counted as ambiguous and not counted for any feature. Also, if the read (or read pair) aligns to more than one location in the reference, it is scored as alignment_not_unique.
yes : the read (or read pair) is counted as ambiguous and is also counted in all features to which it was assigned. Also, if the read (or read pair) aligns to more than one location in the reference, it is scored as alignment_not_unique and also separately for each location. Note that in this case the sum of all counts will not be equal to the number of reads (or read pairs), because those with multiple alignments or overlaps get scored multiple times.

The following figure illustrates the effect of these three modes:

Union and Intersection

Parameters

Name	Description	Allowed values	Default
input file*	Input file(s) in SAM or BAM format.	SAM or BAM format file
sample names	Text file with the names of the samples, one per line (optional and only relevant if you request Excel or GCT format). The names in the file must be in the same order as the input SAM/BAM files. If you do not provide a file the sample names will be deduced from the SAM/BAM file names.	valid file
GTF file*	A GTF or GFF file containing a list of gene model annotations.This file can be gzipped.	selection from dynamic list or valid file
strandedness*	none : a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. forward : a single-end read has to be mapped to the same strand as the feature, for paired-end reads the first read has to be on the same strand and the second read on the opposite strand. reverse : the above rules are reversed. *set to 'none' if your input file contains no strand information.*	none forward reverse	forward
output file*	output file name - format will be Inputfile_Basename.Output_Filename (if you choose GCT format .gct will automatically be added)		HTSeq.counts
output format*		raw HTseq format with header added, suitable for import in Excel GCT	raw HTSeq format
min qual*	Minimum quality to accept a read.	min = 0	10
mode*	Mode to handle reads overlapping more than one feature. See above for a full explanation.	union intersection-strict intersection-nonempty	union
count nonunique*	Whether to count reads that are not uniquely aligned or are ambiguously assigned to features.	yes no	no
count secondary*	Whether to count secondary alignments (which are marked in the SAM/BAM file by a 0x100 flag).	yes no	no
count supplementary*	Whether to count supplementary alignments (which are marked in the SAM/BAM file by a 0x800 flag).	yes no	no
id type*	GTF/GFF attribute used to group features.		gene_id
gene name	GTF/GFF attribute with the name of the gene or some other information that can help to identify the gene in a more user-friendly way than the ID (optional). If you fill this in an extra column will be added to the output table. For Ensembl data gene_name is suitable.
feature type*	Name in the 3^th column of the GTF/GFF input file that is used to identify the features that must be counted.		exon

* - required

Input files

HTSeq.Counts takes as input one or more files with mapped reads in SAM or BAM format and a file with annotations in GTF or GFF format.
Example input files can be found in the module Git repository.

Output files

HTSeq.Count creates a tabular output file with the name of a gene (or more generally, the feature selected with the "id type" parameter) on each line and the total number of reads (or read pairs) mapped to exons (or more generally, the feature selected with the "feature type" parameter) belonging to that gene. If you have set the "gene name" parameter, you will get an extra column in the table. If you have given multiple mapping files as input, you will get a separate column with counts for each sample. The output file ends with summary lines that for each sample count :

__no_feature	number of reads (or read pairs) that were labeled no_feature because they could not be assigned to any feature, see Description for more explanation
__ambiguous	number of reads (or read pairs) that were labeled ambiguous, because they could have been assigned to more than one feature, see Description for more explanation
__too_low_aQual	number of reads (or read pairs) that were skipped due to having low quality according to the "min qual" parameter
__not_aligned	number of reads (or read pairs) in the SAM/BAM file without alignment
__alignment_not_unique	number of reads (or read pairs) with more than one reported alignment. These reads are recognized from the NH optional SAM field tag (if the aligner does not set this field, multiple aligned reads will be counted multiple times).

If you request the Excel output format the module will add a header with the sample name on top of each column with read counts.

If you request GCT the module will format the output file accordingly, and will redirect the summary lines to stdout.txt.

If you have separate .count files output from multiple runs of HTSeq.Count, you can use the GenePattern module, MergeHTSeqCounts to combine those files into a single GCT output file.

Example output files can be found in the module Git repository.

Version Comments

Version	Release date	Description
1	2016-07-15	for htseq-count of HTSeq 0.6.1p2
2	2018-02-22	for htseq-count of HTSeq 0.9.1, allows multiple input files
2.1	2019-08-26	for htseq-count of HTSeq 0.11.1, uses Python 3 and has other defaults
3.0	2020-10-31	Dockerized version of the module using biocontainers/htseq:v0.11.2-1-deb-py3_cv1, and adding python wrapper to detect format of and name sort alignment files.