STAR 2.5.3a aligner

Summary

STAR (Spliced Transcripts Alignment to a Reference) is a fast RNA-seq reads to genome mapper. It differs from other mappers as TopHat in that it gains speed at the expense of consuming more RAM and that it incorporates transcriptome annotation at the indexing build stage rather than at the analysis stage. It can optionally detect non-canonical splices and chimeric transcripts.

Description

STAR starts by searching for sequence substrings that match perfectly between a read and the genome. It can do this very fast by using an uncompressed suffix array. Then it tries to extend these substrings into "seed" alignements, allowing for mismatches and gaps. Finally, it tries to stitch the "seeds" together into mappings, taking into account introns and eventually mate pairs and chimeric transcripts.

STAR always first maps the reads to the genome. If splice junction annotation is available it is only later used to decide which potential splice junctions to accept, the criteria being more lax for annotated junctions.

Two-pass alignment

STAR can be run in 2-pass mode. The 1st pass serves to detect novel junctions, and in the 2nd pass, the detected junctions are added to the annotated junctions, and all reads are re-mapped to finalize the alignments. While this procedure does not significantly increase the number of novel collapsed junctions, it substantially increases the number of reads crossing the novel junctions, by allowing novel splices with shorter overhang. This procedure is especially advantageous in cases where annotations are unavailable or incomplete. There are two ways to do it :
  1. in a single run, by putting STAR.aligner is two-pass mode
  2. in 3 steps :
    1. run STAR.aligner
    2. run STAR.indexer, using the file <basename>.SJ.out.tab of the first run as input. You can merge tab files from runs performed on different samples.
    3. run again STAR.aligner, using the index generated by STAR.indexer as input
The second method has the advantage that the indexing must be done only once and that you can later run STAR.aligner as much as you like without the overhead of the on-the-fly merging of the extra annotation into the index. Also, the first method is slighty less sensitive because if a novel junction is highly expressed in only one sample and weakly (only a few reads with short overhang) in other samples, the per-sample 2-pass approach may only detect this junction in the former sample. On the other hand, the multi-sample 2-pass strategy will detect this junction in all samples.

Shared memory

STAR in the V.I.B. GenePattern server has been set to using shared memory. That means that if several runs are started on the same genome they use the same instance of the index loaded into "live" RAM memory. The memory is cleared after all runs are terminated. This saves on memory space as well as on overhead to load the index into memory.

This feature is disabled if STAR.aligner is run in two-pass mode since the run needs a personal version of the index to merge the extra annotation into. It is also disabled for generating sorted BAM output (and for generating "wiggle" output) since the sorting needs an unpredictable amount of extra memory.

Parameters

Name Description Allowed values Default
input
star index A STAR index. Select a prebuilt index or upload your own as a ZIP file (you can make one using STAR.indexer). selection from dynamic list or valid file
reads pair 1 Unpaired reads or first mate for paired reads, as files in fastA or fastQ format. You can provide several files. valid file(s)
(is required)
reads pair 2 Second mate for paired reads.

The files and the sequences inside the files must be in the same order as for the first mates.

valid file(s)
mapping and reporting of mapped reads
max reads to align Set this if you want to map only a selected set of reads at the top of the input (mainly useful for testing). min = 1
align read end to end Align reads end to end and count all mistmatches. The default is to use instead "soft clipping" at both 5'- and 3'-end, that means to make a local alignment and ignore the contribution of the ends of the read to the final score if this improves the score. This can be useful for ChIP-seq and other DNA sequencing applications with reads that have already been quality-trimmed.
  • yes
  • no
  • no
    max number mismatches Maximum number of mismatches per read. Note that a mate pair is counted as one read. min = 0 10
    max fraction mistaches Maximum number of mismatches per read, expressed as the proportion between the number of mismatches and the mapped length of the read. Note that a mate pair is counted as one read. min = 0
    max = 1
    0.3
    min overhang annotated read Minimum length that read must map at both sides of splice junction in order to accept mapping to a splice junction that is annotated in GTF of tab file. min = 1 3
    min overhang not annotated read Minimum length that read must map at both sides of splice junction in order to accept de novo discovery of splice junction. min = 1 5
    min intron length Minimum size of intron. A gap in the alignment between a read and the genome that is smaller is considered a deletion, not an intron. min = 1 21
    max intron length Maximum size of intron. If a read aligns to the genome with a gap larger than this it is considered a chimeric read. The default value of 500,000 is fine-tuned to mammalian genomes, for plant and yeast genomes you will have to decrease it. min = 1 500000
    mates max gap Maximum distance between mate pair reads. If reads map to the genome farther apart the fragment is considered to be chimeric. The default value of 500,000 is fine-tuned to mammalian genomes, for plant and yeast genomes you will have to decrease it. min = 1 500000
    secondary mapping mismatches range By default STAR only reports reads that map to multiple locations on the genome when they map with the highest possible score. You can ask STAR to report also secondary mappings with up to this much more mismatches than the primary mappings. min = 0 0
    max multimapping Do not report reads that map to more than that many different locations on the genome. min = 1 10
    min report canonical junction overhang Criterium for reporting de novo predicted splice junctions in the SJ.tab.out file : for canonical splice sites at least one read with at least this overhang is needed. min = 1 12
    min report noncanonical junction overhang Criterium for reporting de novo predicted splice junctions in the SJ.tab.out file : for noncanonical splice sites at least 3 reads with at least this overhang are needed. min = 1 30
    map only reported juntions By default all mapped reads are output in the SAM/BAM file. Set this to yes if you want to bring the SAM/BAM file in agreement with the SJ.tab.out file, by outputting only reads that are not spliced or map to splice junctions that are annotated or have been de novo predicted.
    • yes
    • no
    no
    postprocessing and supplementary output
    tow pass Run STAR in 2-pass mode, that is, run STAR a first time, merge the found splice junctions with the splice junction annotation in the index, and run STAR a second time. Consult the Description section for more explanation.
    • yes
    • no
    no
    detect chimeric transcripts
    • yes
    • no
    no
    output unmapped reads
    • yes
    • no
    no
    quantify genes Write a table with number of reads mapped per gene and a supplementary BAM file with mappings to transcriptome instead of to genome coordinates.
    • yes
    • no
    no
    output wiggle file utput a "wiggle" file for viewing in viewers like IGV or UCSC genome browser.
    • none
    • file in BedGraph format
    • file in WIG format
    none
    wiggle signal From which bases to generate the signal for the "wiggle" file. By default they are all used, but you can choose to use only the bases at the 5'-end of the 1st read (useful for CAGE/RAMPAGE) or only from the 2nd mate from read pairs.
    • all
    • only 5' from read 1
    • only from read 2
    all
    output
    output format
    • SAM unsorted
    • BAM unsorted
    • BAM sorted by coordinate
    SAM unsorted
    HI flag Number of "hits" with higest score to label as "primary" in the SAM/BAM file. If you choose "only one" the others are labeled as "secondary" (all reads that map with lower than the highest score are labeled "secondary").
    • only one
    • all
    only one
    output prefix The prefix to use for the output file names. STAR

    Input files

    STAR.aligner takes as input a set of reads and a genome.

    The reads can be provided in fastQ format or in fastA format. The reads can be spread over several files.

    It is possible to provide several files with reads. For mate pair experiments it is necessary to provide two sets of input files and it is important that the files as well as the reads inside each file are in the right order so that STAR can find the corresponding partners of each mate pair.

    The genome must be provided as an index, there is no need for the original sequences and the original annotation. STAR.aligner has access to a series of prebuilt indexes. Alternatively, the user can provide an index of his own. The index must be in a ZIP file. GenePattern has a tool STAR.indexer to make the index from a series of fastA files.

    Output files

    STAR.aligner creates a whole series of output files. At minimum, a run will produce the following :

    By default STAR.aligner does not search for fusion reads, that are reads that map to regions of the genome that are located on different chromosomes, on different strands of the same chromosome or on widely distant regions of the same chromosome, so that they likely derive from sequencing chimeric transcripts. If you request it, STAR.aligner will output a separate file in SAM format <basename>.Chimeric.out.sam.

    You can request STAR.aligner to write files <basename>.Unmapped.out.mate1 (and eventually <basename>.Unmapped.out.mate2) with the reads that could not be mapped, in the same fastA or fastQ format as the input.

    When you request STAR.aligner to quantify genes it will write 2 supplementary files :

    You can request the output of "wiggle" files , which are useful for vizualization of the RNA-seq signal on genomic browsers as the UCSC genomic browser or IGV. The signal represents the number of reads crossing each genomic base. There are separate files for the two strands and there are separate files for uniquely and for multimappnig reads ; in the latter case the contribution of the multimappers will be divided by the number of loci they map to. You can choose between output in BedGraph or in WIG format. STAR.aligner will write output files with respectively names <basename>.Signal.Unique.str1.out.bg, <basename>.Signal.Unique.str2.out.bg, <basename>.Signal.UniqueMultiple.str1.out.bg, <basename>.Signal.UniqueMultiple.str2.out.bg or <basename>.Signal.Unique.str1.out.wig, <basename>.Signal.Unique.str2.out.wig, <basename>.Signal.UniqueMultiple.str1.out.wig, <basename>.Signal.UniqueMultiple.str2.out.wig. Since the generation of the "wiggle" files demands reads sorted by coordinate, asking for "wiggle" output sets STAR.aligner automatically into making sorted BAM output.

    Links

    References

    1. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR : STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15-21 (2013). PubMed 23104886
    2. Dobin A, Gingeras TR : Optimizing RNA-Seq Mapping with STAR. Methods Mol. Biol. 1415:245-265 (2016). PubMed 27115637

    Author

    The GenePattern interface is made by Guy Bottu, V.I.B.-B.I.T.S.

    The STAR software is developed by a team of programmers headed by Alexander Dobin at Cold Spring Harbor Laboratory.

    Version Comments

    VersionRelease dateDescription
    12016-08-29for STAR 2.5.2a