HTSeq.Count

Summary

Takes alignment files in SAM/BAM format and a feature file in GTF/GFF format (usually with exon annotation), name sorts the alignment file(s) and calculates for each sample the number of reads mapping to each feature. Can optionally produce output in GCT format.

Authorship

htseq-count is a Python script, distributed together with the HTSeq Python library developed by Simon Anders at EMBL Heidelberg.
This module uses HTSeq v0.11.2 via the biocontainers HTSeq.count image biocontainers/htseq:v0.11.2-1-deb-py3_cv1.
HTSeq.Count was originally wrapped as a GenePattern module by the staff of VIB BioinformaticsCore, and then "Dockerized" by the GenePattern team for use on Docker enabled GenePattern servers.
GenePattern Module wrapping: Barbara Hill, GenePattern Team; Guy Bottu, VIB BioinformaticsCore

References

  1. Anders S, Pyl PT, Huber W : HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166-169 (2015) PubMed 25260700

Contact

Please post in the GenePattern Forum for assistance with this module.

Links

Description

Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature. A feature is here an interval (i.e. a range of positions) on a chromosome or a union of such intervals. In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. One may also consider each exon as a feature, e.g., in order to check for alternative splicing. For comparative ChIP-Seq, the features might be binding region from a pre-determined list.

Special care must be taken to decide how to deal with reads that align to or overlap with more than one feature. HTSeq.Count allows the user to choose between three modes, which work as follows : For each position i in the read, a set S(i) is defined as the set of all features overlapping position i. Then, consider the set S, which is (with i running through all position within the read or a read pair) either :

If S contains precisely one feature, the read (or read pair) is counted for this feature. If S is empty, the read (or read pair) is counted as no_feature. If S contains more than one feature, HTSeq.Count behaves differently based on the setting of the count nonunique parameter :

The following figure illustrates the effect of these three modes:

Union and Intersection

Parameters

Name Description Allowed values Default
input file* Input file(s) in SAM or BAM format. SAM or BAM format file
sample names Text file with the names of the samples, one per line (optional and only relevant if you request Excel or GCT format). The names in the file must be in the same order as the input SAM/BAM files. If you do not provide a file the sample names will be deduced from the SAM/BAM file names. valid file
GTF file* A GTF or GFF file containing a list of gene model annotations.This file can be gzipped. selection from dynamic list or valid file
strandedness* none : a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature.

forward : a single-end read has to be mapped to the same strand as the feature, for paired-end reads the first read has to be on the same strand and the second read on the opposite strand.

reverse : the above rules are reversed.

set to 'none' if your input file contains no strand information.

  • none
  • forward
  • reverse
forward
output file* output file name - format will be Inputfile_Basename.Output_Filename (if you choose GCT format .gct will automatically be added) HTSeq.counts
output format*
  • raw HTseq format
  • with header added, suitable for import in Excel
  • GCT
  • raw HTSeq format
    min qual* Minimum quality to accept a read. min = 0 10
    mode* Mode to handle reads overlapping more than one feature. See above for a full explanation.
    • union
    • intersection-strict
    • intersection-nonempty
    union
    count nonunique* Whether to count reads that are not uniquely aligned or are ambiguously assigned to features.
  • yes
  • no
  • no
    count secondary* Whether to count secondary alignments (which are marked in the SAM/BAM file by a 0x100 flag).
  • yes
  • no
  • no
    count supplementary* Whether to count supplementary alignments (which are marked in the SAM/BAM file by a 0x800 flag).
  • yes
  • no
  • no
    id type* GTF/GFF attribute used to group features. gene_id
    gene name GTF/GFF attribute with the name of the gene or some other information that can help to identify the gene in a more user-friendly way than the ID (optional). If you fill this in an extra column will be added to the output table. For Ensembl data gene_name is suitable.
    feature type* Name in the 3th column of the GTF/GFF input file that is used to identify the features that must be counted. exon
    * - required

    Input files

    HTSeq.Counts takes as input one or more files with mapped reads in SAM or BAM format and a file with annotations in GTF or GFF format.
    Example input files can be found in the module Git repository.

    Output files

    HTSeq.Count creates a tabular output file with the name of a gene (or more generally, the feature selected with the "id type" parameter) on each line and the total number of reads (or read pairs) mapped to exons (or more generally, the feature selected with the "feature type" parameter) belonging to that gene. If you have set the "gene name" parameter, you will get an extra column in the table. If you have given multiple mapping files as input, you will get a separate column with counts for each sample. The output file ends with summary lines that for each sample count :
    __no_featurenumber of reads (or read pairs) that were labeled no_feature because they could not be assigned to any feature, see Description for more explanation
    __ambiguousnumber of reads (or read pairs) that were labeled ambiguous, because they could have been assigned to more than one feature, see Description for more explanation
    __too_low_aQualnumber of reads (or read pairs) that were skipped due to having low quality according to the "min qual" parameter
    __not_alignednumber of reads (or read pairs) in the SAM/BAM file without alignment
    __alignment_not_uniquenumber of reads (or read pairs) with more than one reported alignment. These reads are recognized from the NH optional SAM field tag (if the aligner does not set this field, multiple aligned reads will be counted multiple times).

    If you request the Excel output format the module will add a header with the sample name on top of each column with read counts.

    If you request GCT the module will format the output file accordingly, and will redirect the summary lines to stdout.txt.

    If you have separate .count files output from multiple runs of HTSeq.Count, you can use the GenePattern module, MergeHTSeqCounts to combine those files into a single GCT output file.

    Example output files can be found in the module Git repository.

    Version Comments

    VersionRelease dateDescription
    12016-07-15for htseq-count of HTSeq 0.6.1p2
    22018-02-22for htseq-count of HTSeq 0.9.1, allows multiple input files
    2.12019-08-26for htseq-count of HTSeq 0.11.1, uses Python 3 and has other defaults
    3.02020-10-31Dockerized version of the module using biocontainers/htseq:v0.11.2-1-deb-py3_cv1, and adding python wrapper to detect format of and name sort alignment files.