Identifying the Origin of Replication in a Pseudomonas Megaplasmid

Abstract

Origins of replication are critical for plasmid replication within bacterial hosts. Pseudomonas syrainge pv. Lacrhymans 107 contains a megaplasmid that is highly uncharacterized but has previously been shown to illicit phenotypic costs. The plasmid is readily transferable among Pseudomonads except for the clinically relevant P. aeruginsa. Here I began the investigation on whether host range is the cause of a lack of megaplasmid conjugation into P. aeruginosa. I developed a small computational pipeline calculating GC skew and identification of repetitive motifs. These two methods successfully identified a similar region for the origin of replication.

Introduction

Pseudomonas is a diverse genus of Gram-negative bacteria that include two medically and agriculturally important pathogens: P. syringae and P. aeruginosa. P. syringae is a plant pathogen that causes economic loss in a wide variety of fruits, vegetables, and ornamental plants across the globe^1,2,3. The most recent outbreak of this phytopathogen is currently threatening kiwi production around the world, from Italy to New Zealand⁴. P. aeruginosa is a significant opportunistic pathogen of humans that is a main driver in morbidity and mortality in cystic fibrosis patients, burn victims, and the immunocompromised⁵. P. syringae and P. aeruginosa are model organisms concerning plant pathogenicity, biofilms, antibiotic resistance, and quorum sensing and thus important organisms at both economic and basic research levels^5,6,7,8.

Bacterial traits are often influenced by extrachromosomal elements called plasmids – circular, double-stranded fragments of DNA that can undergo horizontal transfer and often carry phenotypically important traits. Our lab has discovered a unique plasmid, pMPPla107 (pMP), that is nearly 1Mb in P. syringae pv. lacrhymans 107⁽⁹⁾. The role of pMP under natural conditions remains unknown ¹⁰. We have previously shown that acquisition of the pMP by previously naïve strains results in phenotypic defects in biofilm formation, antibiotic resistance, and thermotolerance¹¹. I have also discovered an unknown agent found in the supernatants of various Pseudomonad cell suspensions that specifically targets and inhibits growth of Pseudomonads containing the pMP.

It would be interesting to understand the effects of the pMP on P. aeruginosa once introduced to P. aeruginosa. Conjugation of the plasmid has been successful across a number of Pseudomonas spp., however P. aeruginosa has an ability to resist conjugation. There are several reasons why this could be possible. In this project I address the issue of plasmid host range and potential genes unnecessary for replication and conjugation but necessary for resistance. Using computational methods such as GC skew calculations and identification of repetitive motifs, I was able to predict a potential origin of replication. Identification of the origin of replication will allow us to cut the plasmid into smaller plasmids containing genes associated with replication with the use of restriction enzymes. These smaller plasmids can then be tested for conjugation into P. aeruginosa.

Methods

GC Skew

GC skew was calculated using an R script and the seqinR package¹². The seqinR package allows the fasta file containing the pMP sequence data to be converted into a variable for R. The variable used should be ‘myqeq’ for the GC Skew script to recognize it. The script then calculates GC skew [(G-C)/(G+C)] in a sliding window of choice and plots this percentage across the entire sequence. I used a sliding window of 5000 nucleotides and a step of 5000 nucleotides.

Repetitive Motif Finder

The motif finder program is a custom-made python script. This script will open a designated file and an output file. It will then enter a for loop that will loop through each line of a sequence and stores a string if in line if it does not start with ‘>’. I also used string splicing to only search the sequence string in the area that the GC skew figure indicated a potential origin of replication.

The script has two functions within it. The first is motif_count with three parameters (dna, kmer size, and min percentage). The function begins by counting up the total amount of kmers and storing within total_kmers. If a minimum percentage was selected then a minimum count will also be calculated. The function then opens up a dictionary that contains the majority of the data. A for loop is initiated that moves along the sequence line in a window size determined by k. This loop also opens a list where motif positions will be stored. Finally, a regular expression is used to store specific kmers within kmer_find. This variable is necessary for the following for loop which searches for this regex stored motif within the DNA sequence and uses .span() to find its position. Because .span() stores the data as a tuple the list function is used so that 1 and be added to the starting and ending positions. This ensures that nucleotide one’s position is labeled as 1 and not 0. This second for loop then exists and the dictionary within a dictionary is built. After the dictionaries are built a final for loop is engaged that will remove any motifs that do not make the designated cutoff in the min percentage parameter. Finally a small function is created that will create a list of the motifs. .Counter from collections is then to return the top ten motifs.

Programs:

All scripts can be found in a github repository: https://github.com/basmith89/BrianPLP599_scriptsn_such

Results

GC skew calculations across the pMP sequence had a dramatic shift in the area of ~590kbp (Fig. 1). The data suggests this position contains the origin of replication and should therefore also contain characteristics such as repetitive motifs for DnaA boxes and genes associated with replication like oriC. There is another peak at ~790kbp that is smaller than the first. For this reason I did not pursue this location when investigating repetitive motifs.

After the range in which the origin of replication is expected it is possible to use a motif finder scanning for repetitive motifs. I looked at a 100kbp range surrounding the 590kbp position using a sliding window of 9 nucleotides. The motif finder script returned the top ten hits. The first was 460 counts of ‘NNNNNNNNN’ followed by ‘CCAAATTGG’ and ‘CCCAAATTG’ 8 times (Fig 2). Both of these repetitive motifs repeat in a small area within 1000 nucleotides of each other (Output file). Furthermore, these motifs are found in the same 590kbp region that the GC skew script predicted the origin to be. These data suggest that this region is the origin of replication.

Discussion

In this analysis I was able to use two separate techniques to achieve one goal. Using GC skew calculations across the plasmid sequence and repetitive motif identification I successfully inferred an origin of replication within a poorly characterized plasmid sequence. Both of these techniques have been shown previously to infer origins of replication¹³. However, the repetitive motif for the organisms DnaA boxes must previously be known.

Therefore, I have developed a purely computational system where previously unknown origins can be found using GC skew analysis and verified with repetitive motifs. These data will then allow groups to identify regions where the origin of replication is likely to occur and follow up using molecular techniques. For example, I intend to utilize restriction enzymes to cut areas surrounding the origin region to create smaller plasmids that still retain the ability to replicate within a host. Additionally, if open reading frames are known within this nucleotide region further confirmation can be done using tools like BLAST to search for genes such as oriC.

Future work to be down includes BLAST analysis within this discovered region and creating minimalistic plasmids that can replicate successfully to attempt to conjugate them into P. aeruginosa. Modification of the custom script should also be done to allow an easier user interface when running the script. For example, asking the user for command arguments for each parameter rather than having to edit the script for each specific run. The packaging of the scripts together may also be beneficial where one new master script would run both the R and Python script and ask the user for all parameters when necessary. This would allow for a broader spectrum of users to utilize these methods.

References

(1) Mercado-Blanco J. and Bakker P.A.H.M. June 2007. Antonie van Leeuwenhoek. 92:367-389.

(2) Scortichini M. et al. 2012. Mol. Plant Path. 13(7), 631-640.

(3) Ferrante P. and Scortichini M. 2010. Plant Path. 59, 954-962.

(4) Butler M.I. et al. Feb. 27, 2013. PLoS One.

(5) Sadikot R.T. et al. June 1, 2005. Am J Respir Crit Care Med. 171(11) 1209-1223. (6) Morales E. et al. May 23, 2012. BMC Health Services Research. 12:122.

(7) Joardar V. et al. Sep. 2005. J. Bacteriol. 18 6488-6498.

(8) Fouts D.E. et al. Feb 19, 2002. PNAS. 99(4): 2275-2280.

(9) Baltrus D.A. et al. July 14, 2011. PLoS Pathogens.

(10) Romanchuk A. et al. May 2014. Plasmid. 73 16-25.

(11) Dougherty K. et al. July 21, 2014. PLoS One.

(12) Charif, D. and Lobry, J.R. 2007

(13) Gao F. and Zhang C. February 2008. BMC Bioinformatics

PLS 599 2014

Smith - Final Paper