Paired-End Analysis of Transcription Start Sites in Arabidopsis Reveals Plant-Specific Promoter Signatures

Publication Online: http://www.plantcell.org/content/26/7/2746.long

About this study: 
Understanding of plant gene promoter architecture has long been a difficult challenge due to the lack of large-scale data and analysis methods on the topic. In this study we present a publicly available large-scale transcription start site (TSS) dataset in plants using a high-resolution method for analysis of 5' ends of mRNA transcripts. Our dataset is produced using the Paired-End Analysis of Transcription Start Sites (PEAT) protocol, providing millions of TSS locations from wildtype Col-0 Arabidopsis whole root samples. Using this quality-filtered dataset, we first categorize the different shapes taken on by the TSS location distributions into TSS “tag clusters”. We then design a high-resolution machine-learning model that predicts the presence of a TSS tag cluster with an auROC near 0.98 for each cluster shape. We use this model to analyze the transcription factor binding site content of different promoter shapes. We find that while canonical notions of sharp narrow peak TATA-containing promoters vs more broad “TATA-less” promoters have some merit, the model shows that a large compendium of known DNA sequence binding elements is actually necessary and sufficient for accurate promoter prediction in the case of all tag cluster shapes. These elements form promoter signatures for transcription initiation. We present precise results on the usage of these elements, and provide our Plant PEAT Peaks (3PEAT) model which predicts the presence of PEAT tag clusters directly from sequence.

Additional Supplementary Material


3PEAT Transcription Factor Binding Site Scanner (3PEAT-TFBS-Scanner)

This tool scans for Transcription Factor Binding Sites (TFBS) using Positional Weight Matrices (PWMs) with local backgroud correction. It outputs log-likelihood scores of TFBSs given an input sequence which can further be post-processed.


3PEAT Model Scanning Tool

We incorporated the model described above into an open-source tool available below, called "3PEAT" (Plant PEAT Peaks). Versions of the tool for each PEAT peak shape type (Narrow Peak, Broad with Peak, and Weak Peak) are provided. Each tool version uses the model to identify probable locations for TSSs in an input sequence. Specifically, one can provide a FASTA sequence to the tool and obtain a G-Browse custom track that displays the probability of observing a TSS peak mode for a peak of the given shape, at each position in the sequence.


Installation Instructions

  1. Download and install l1_logreg and ensure that l1_logreg_classify is available in your $PATH:
    export PATH="$PATH:/path/to/directory/with/l1_logreg"
  2. Download and extract both tools into a common directory:
    mkdir 3PEAT && cd 3PEAT
    wget http://megraw.cgrb.oregonstate.edu/software/3PEAT/TFBS-Scanner/3PEAT_TFBS_Scanner.tar.gz http://megraw.cgrb.oregonstate.edu/software/3PEAT/Model/3PEAT_GeneScanModel.tar.gz
    tar xvzf 3PEAT_TFBS_Scanner.tar.gz
    tar xvzf 3PEAT_GeneScanModel.tar.gz
  3. Compile 3PEAT-TFBS-Scanner:
    cd LogLikScanner/src
    cd ../..
  4. Run the run_test_example.sh script from the model scanning tool:
    cd Model

Please see the LICENSE file included in both archives for copyright and distribution rights.


If any of these tools are used for work which results in a publication, we would appreciate citation of the following article:

Morton T, Petricka J, Corcoran DL, Li S, Winter CM, Carda A, Benfey PN, Ohler U, Megraw M. (2014). Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures. Plant Cell, 26:2746-60.