RTAnalyzer: Web application to identify new retrotransposons

December 2006


1 RTAnalyzer

1.1 Abstract

In order to recognize sequences of retrotransposed origin, RTAnalyzer will detect the signature of L1 retrotransposition to determine if the sequence searched for was inserted by L1. Genomic database can be searched for a sequence of interest to see if it is repeated due to L1 retrotransposition. The input sequence is used to BLAST the database, each hit is then analyzed, including extra sequences in 5' and 3'. RTAnalyzer looks for a target site duplication, a polyA in 3' and an endonuclease cleavage site in 5' to calculate an overall score of retrotransposition.

1.2 Background

Mobile elements have largely contributed to shape mammalian genomes. In humans, LINE1 (L1) retrotransposons account for around 17% of the genome (reviewed in [1,2]). The vast majority (99.8%) of L1s is not mobile. Nevertheless, their is approximately 60 to 100 L1 elements still capable of retrotransposing in human and 3000 in mouse [3,4]. L1 contains two ORFs, one of which has endonuclease and reverse transcriptase activity [5]. These proteins have a cis preference for the L1 mRNA [6], but are also able to mobilize other RNAs (trans activity) such as SINEs [7,8]. L1s are also responsible for the insertion of many processed pseudogenes. Sequences mRNAs, devoid of introns and with a polyA, can be found throughout the genome and are usually easily recognized as retrotransposed processed pseudogenes [9]. PolyA tail is usually regarded as important for retrotransposition through L1 mechanism [8], for this reason a poly A is normally found at the 3' end of the insertion. Another L1 trademark, the target site duplication (TSD), is explainable by the endonuclease cleavage, typically 15 nt apart on the two strands. Thus, the TSD is created after the overhangs are filled on each side of the insertion (figure 2). Finally, L1 endonuclease has some preference for a cleavage site with two purines followed by four pyrimidines, especially TT/AAAA [1].

The study of small non-coding RNA genes is often complicated by the presence of pseudogenes. Many such RNAs, like U snRNAs, have many copies of their gene in the genome, but also have pseudogenes [10]. Pseudogene study can also be impaired by the size of inserted elements, in many cases a 5' portion of variable length is missing from the RNA sequence that was inserted because of reverse transcription stops. This combined to accumulated mutations will often imply it will be missed by a common BLAST search. Lowering BLAST criteria at this level can lead to numerous hits, most of which might represent false positives. The tool we designed can be especially efficient in this regard because it will take other characteristics in consideration.

Moreover, there are currently many major mammalian genome sequencing projects and consequently many new elements retrotransposed by L1 to look for and analyze. For this purpose, RTAnalyzer should reveal an efficient tool. This software will be useful for indepth analysis of non-autonomous retrotransposons in sequenced mammals to help us understand retrotransosition in our evolutionary tree.

1.3 Algorithm and implementation

1.3.1 Implementation

RTAnalyzer searches and scores retrotransposons on whole genome sequences. This software was programmed in Perl version 5.8.8 using freely available CPAN modules [1,2]. The web application runs on a server with apache 2.0.54 and gentoo linux 2.6.17-gentoo-r4 [3,4]. The server communicates with an informatic cluster to distribute retrotransposons search task to three different slave nodes.

1.3.2 Search algorithm

View larger image
(You may have to click on the new image to see it larger)
Fig. 1. Schematic representation of search algorithm Homology (Blast hit) and extremities (E5 and E3) search

Because it is crucial to find the exact positions of 5' and 3' extremities of the homologous sequence for the calculation of the RetroScore, the goal of step 1 and 2 of RTAnalyzer algorithm is the determination of these accurate extremities. Although the BLAST will normally determine those extremities properly, it will sometimes leave out up to 10 base pairs, or even 20, that would have a lower percentage identity.

To find retrotranspositions element across whole genome sequence, RTAnalyzer first performs a homology search using Blast [5]. The user specifies the initial sequence to search (ISS) in the genome. After this initial step, the program tries to find the 3' extremity of ISS in the retrieved blast hit with Matcher (local alignment program provided by EMBOSS [6]). A 3' sequence of ISS is extracted iteratively and aligned against the 3' half the blast hit plus 3' offset. At each iteration, the 3' ISS sequence extracted is shifted one nucleotide upstream until matcher score is superior to the minimum score cutoff. Similar procedure is applied for the 5' extremity (see fig. 1). Target Site Duplication (TSD) search

Next, RTAnalyzer identifies Target Site Duplications (TSD). To complete this task, RTAnalyzer extracts the sequence found just before the 5' extremity (5' TSD mask) and tries to align it against the sequence between e3e and he+offset. A partial TSD score is calculated (see Scoring algorithm) from this alignment. The operation is repeated with a different 5' TSD mask shifted by one nucleotide upstream until maximum distance (REd) specified by user is reached. The alignment with the best TSD score is kept. PolyA tail search

To complete retrotransposons signature, we need to identify a polyA. To do so, we first extract the sequence between e3e and r3s. After, starting from 5', we try to find consecutive adenosines to start the polyA tail. It will be extended until consecutive non-As are reached. When this point is reached, we verify if the following nucleotides in a look ahead window have a minimum A% to keep extending polyA. Consensus endonuclease cleavage site (Target) identification

Finally, we extract the sequence of the consensus endonuclease cleavage site. This sequence overlaps the 5' TSD.

1.3.3 RetroScore algorithm

RTAnalyzer calculates a RetroScore for each signature found.

RetroScore is calculated using the following formula:

RetroScore = (TSD * 0,6) + (polyA * 0,3) + (Target * 0,1)

TSD = ((AS / ASmax) * 100) - ( 10 * sqrt( REd) ) - ( 2 * (PR3d) )

Where AS is the alignment score, ASmax the maximum possible alignment score using the EDNAFULL alignment matrix, REd is the distance between 5' TSD and extremity and PR3d is the distance between polyA tail and 3' TSD.

polyA = ((((A / ae - as) * sqrt(ae - as)) / 5) * 100) - (2 * PE3d) where A is the number of As found in polyA, ae is the polyA end position, as is the polyA start position and PE3d is the distance between 3' extremity end position and polyA start position

Finally, the target score is equal to 100 if the sequence extracted is found in the target sequence list, otherwise, the target score is 0.

1.4 Acknowledgements

We thank J. Brassard and JC Houde for programming. This work was supported by grants from Genome Quebec to JPP and GB and by grants from a Canadian Institute of Health Research (CIHR; EOP-38322) to JPP. The RNA group is supported by a grant from CIHR (PRG-80169) and Universite de Sherbrooke. JP was the recipient of a predoctoral fellowship from FRSQ. JPP holds the Canada Research Chair in Genomics and Catalytic RNA.

1.5 References

1. T h e Pe r l d i r e c t o r y [ http://www.perl.org/].
2. CPAN: Comprehensive Perl archive network [http://cpan.org/].
3. The Apache Software Foundation [http://www.apache.org/]
4. gentoo linux [http://www.gentoo.org/]
5. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
6. Rice,P. Longden,I. and Bleasby,A.,Trends in Genetics 16, (6) pp276-277
7. Ostertag, E.M. and Kazazian, H.H., Jr. (2001) Biology of mammalian L1 retrotransposons. Annu. Rev. Genet., 35, 501-538.
8. Kazazian, H.H., Jr. (2004) Mobile elements : drivers of genome evolution. Science, 303, 1626-1632.
9. Goodier,J.L., Ostertag,E.M., Du,K. and Kazazian,H.H.,Jr. (2001) A novel active L1 retrotransposon subfamily in the mouse. Genome Res., 11, 1677-1685.
10. Brouha,B., Schustak,J., Badge,R.M., Lutz-Prigge,S., Farley,A.H., Moran,J.V. and Kazazian,H.H.,Jr. (2003) HotL1s account for the bulk of retrotransposition in the human population. Proc. Natl Acad. Sci. USA, 100, 5280-5285.
11. Feng, Q., Moran, J.V., Kazazian, H.H., Jr. And Boeke, J.D. (1996) Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell. 87, 905-916.
12. Wei,W., Gilbert,N., Ooi,S.L., Lawler,J.F., Ostertag,E.M., Kazazian,H.H., Boeke,J.D. and Moran,J.V. (2001) Human L1 retrotransposition in the human population. Mol. Cell. Biol., 21, 1429-1439.
13. Weiner,A.M. (2000) Do all SINEs lead to LINEs? Nature Genet., 24, 332-333.
14. Dewannieux,M., Esnault,C. and Heidmann,T. (2003) LINE-mediated retrotransposition of marked Alu sequences. Nature Genet., 35, 41-48.
15. Esnault,C., Maestre,J. and Heidmann,T. (2000) Human LINE retrotransposons generate processed pseudogenes. Nature Genet., 24, 363-367.
16. Lander ES et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860-921.

2 Supplementary Material

2.1 Example

If you input the following sequence:


(which corresponds to hY4 small RNA) in the "Blast sequence" section, give a name to your query (in the appropriate field) and write your email adress, you can already submit your query. You should receive an email within an hour or so. You can then copy the password provided in your email, follow the link, input your "job Id" and the password so that you can access your results. This sequence should normally give about 100 valid hits and 65 invalid (i.e. without a good L1 signature).

Many options are available to visualize your results. You can sort them by any of the column headers (HitId, Chr, Status, NCBI, Start, End, evalue, %Identity, Total Score). You can also click on the NCBI link to access the corresponding sequence in genbank or on the HitId to visualize this particular hit in detail (such as the colored sequence indicating elements of signature). Also, if you want to keep track of your results, compare with other queries or do whatever analysis you might be interested in, you can export to an excel file (which will contain valid AND invalid results, but still allowing you to estimate retrotranposition by sorting according to Retroscore).

2.2 Important notes

If you plan on looking at the retrotransposition of a longer RNA (> 300 nt), we suggest that you change a few parameters. Note that RTAnalyzer was designed to analyze signatures for relatively small RNAs (less than 300nt), but by tweaking some of the advanced parameters you might nevertheless be able to obtain some results. Click on "show advanced parameters". Decrease the "Blast evalue cutoff" to 0.01and increase the extremity "mask length" to 50 (or at least 5% of your query length).

Also, note that if you look for an mRNA, one of the best signature element is the absence of intronic sequence as compared to the original gene (although some pre-mRNAs do not possess introns, the majority of mammalian mRNAs are spliced).

You can ask questions and start topics of discussion concerning RTAnalyzer on the riboclub forum accessible from the riboclub homepage.