Charting the sORF-ome of Arabidopsis thaliana

G Menschaert1, S Verbruggen1, A Lambert2, A Nouwens2, U Dressel2, B Carroll2 and JA Rothnagel2

  1. BioBix, Laboratory of Bioinformatics and Computational Genomics, Ghent University, 9000 Ghent, Belgium
  2. School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Australia.

Many small ORFs (<= 100AA) are often missed in genome annotations. To overcome this limitation Hanada et al. (2013) performed a large-scale study in Arabidopsis thaliana using in silico high coding potential as the criteria to identify putative intergenic coding regions. We have extended their study by employing ribosome profiling; the transcriptome-wide sequencing of ribosome protected mRNA fragments, which has greatly facilitated the detection of putative small coding ORFs. Using publicly available (Juntawong et al. 2014, Liu et al. 2013) and in-house generated Ribo-seq data, we charted the sORFome of Arabidopsis. We exploited the property of lactimidomycin to cause ribosome stalling at sites of translation initiation to identify cognate and alternative translation initiation sites with sub-codon to single-nucleotide resolution. Based on this ribosome protected fragment (RPF) signal, we were able to validate the recent discovery of miPEPs in primary miRNA transcripts of Arabidopsis (Lauressergues et al. 2015). Furthermore, we discovered many other putative coding sORFs in different categories of ncRNAs. The latest Arabidopsis thaliana annotation (Araport11, June 2016) also reports on 726 novel transcript regions (based on tissue-specific RNA-seq libraries from 113 datasets). In these novel transcribed regions, many (s)ORFs were also clearly delineated. We also corroborated these findings with matching mass spectrometry data. We first evaluated several extraction protocols on flower buds and other plant tissues for use in in LC-MS/MS. The methods described by Hernandez and Vierling (1993) and Conlon and Salter (2007) provided the highest protein concentrations. The Conlon and Salter method produced the greatest number of mass spectra for proteins <20 kDa. The Hernandez and Vierling method, despite producing a high protein yield, resulted in the lowest number of spectra. This proteogenomics approach greatly enhances the identification of bona fide translatable sORFs in plants.
References: Hanada et al. (2013) Proc Natl Acad Sci U S A. 110:2395-2400. Juntawong et al. (2014) Proc Natl Acad Sci U S A. 111:E203–E212. Liu et al. (2013) Plant Cell. 25:3699–3710. Lauressergues et al. (2015) Nature. 520:90–93. Hernandez and Vierling (1993) Plant physiology 101:1209–1216. Conlon and Salter (2007) Methods Mol Biol. Humana Press, pp. 379-383.