OriV-Finder

OriV-Finder Analysis Pipeline

To systematically identify the oriVs in plasmids, a bioinformatics analysis pipeline was developed, as shown in the figure below.

Step 1: The query genome was annotated, and then the CDSs and IGSs were obtained for subsequent processing.

Step 2: Detection of homologous genes of RIPs in CDSs

Step 3: Detection of oriV features in IGSs

OriV similar sequences: BLASTN is employed to identify similar sequences to curated oriV regions collected from the DoriC database and literatures;
ColE1 like regions: The presence of ColE1 like region has been searched for by the Infernal program;
Iteron sequences: A K-mer sliding window approach combined with Shannon entropy scoring is applied to detect potential iteron sequences;
AT-rich region: A modified Z-curve-based algorithm is utilized to identify potential AT-rich regions;
Conserved motifs: Common conserved motifs associated with oriVs are identified.

Step 4: Priority-based scoring system for potential oriVs

Based on the detection results, OriV-Finder employs a priority-based scoring system to identify potential oriVs. Each IGS will be assigned a type by this scoring system, and the IGS with high priority will be output as potential oriVs.

Type 1 (Highest Priority)
Blast hit presence: If an oriV similar sequence is detected within an IGS, the IGS is designated as Type 1(Evidence: "Blast hit");
ColE1 like region presence: If a ColE1 like region is detected within an IGS, the IGS is designated as Type 1 (Evidence: "ColE1 like");
RIP flanking IGSs: For each identified homologous gene of RIP, the CDS and its four flanking IGSs (two upstream and two downstream) were evaluated, and the region with the highest score among these five segments is designated as Type 1 (Evidence: "RIP" or "nearby RIP").
Type 2 (Secondary Priority)
No RIP homologs: In the absence of RIP homologs, for each partition protein (like ParA or ParB), the six intergenic sequences (IGSs) flanking the partition protein (three upstream and three downstream) were evaluated. The IGS with the highest comprehensive score among these six candidates is designated as Type 2. *This result should be treated with caution and further analysis is needed.
Type 3 (Lowest Priority)
No RIP or partition protein: If neither RIP homolog nor partition-related protein is present, the IGS with the highest score is designated as Type 3. *This result should be treated with caution and further analysis is needed.

OriV-Finder Quick Start Guide

1. Upload Sequence

OriV-Finder provides users with two methods to submit their nucleic acid sequences:

File Upload: Users can upload sequence files in supported formats.
Text Input: Users can directly enter the nucleic acid sequence into the text box.

Note: Users can click the "Example" button to automatically load an example sequence into the text box.

Warning: The maximum file upload size is 20MB.

2. Configure Advanced Options (Optional)

Based on the location of the replication initiation protein, OriV-Finder can identify potential replication origins. Users may adjust the following parameters as needed:

Advanced Options Configuration Interface

Maximum mmseq E-value
Default: 1e-5
Recommended Setting: ≤ 1e-5

Minimum mmseq Bitscore
Default: 50
Recommended Setting: ≥ 50

Minimum mmseq Alignment Length (Alnlen)
Default: 100
Recommended Setting: ≥ 100

Maximum Iteron Length
Default: 22
Recommended Setting: = 22

Minimum Iteron Length
Default: 11
Recommended Setting: = 11
Note: Increasing this value can reduce the time required to identify iterons.

Gaussian_filter sigma
Default: 5
Recommended Setting: 5 ± 0.5
Note: An increase in the sigma value improves the smoothness of the Smoothed GC-profile, whereas a decrease diminishes its smoothness.

3. Submit Task and Retrieve Results

Upon task submission, users will receive a unique job ID. Users can either patiently wait for 5-10 minutes for their results to be processed or use the provided job ID to retrieve the corresponding results through the 'Historical Task Query' feature.

Important: Results are stored for only seven days. Users can view and download the relevant results on the results page.

4. How to interpret the results

Whole-sequence visualization

The whole-sequence visualization image is started at the point of maximum GC disparity. The image includes the GC disparity curve and the K-mer cumulative score graph, and marks the locations of replication origin proteins and replication origins with red and green rectangles, respectively. Users can zoom in and out of the image by scrolling the mouse wheel or select and view specific datasets of interest by clicking on the legend above.

Some plasmids with certain replication modes exhibit the phenomenon of strand bias, with the origin of replication typically located near the lowest point of GC-disparity. Additionally, some plasmids with certain replication modes contain iterons at their origin of replication, which can exhibit peaks in Kmer cumulative score.

Replication origin visualization

The replication origin visualization module further displays the locations, level, and supporting evidence of replication origins through both GC-profile and smoothed GC-profile, with AT-rich and GC-rich regions marked in pink and green, respectively. This module also includes a sequence schematic that illustrates the distribution of motifs and potential iterons or RNA primers, providing a comprehensive overview of the replication initiation landscape.

When multiple results of the same Type appear, OriV-Finder ranks and outputs the results in descending order based on their scores.

Replication initiation protein information

The detailed replication initiation protein information module includes the MMseq alignment results between the hypothetical replication initiation protein and the seed protein, as well as the conserved domain features of the replication origin protein. In the HTML table, users can click on amino acids to expand the full protein sequence and use the copy button to easily copy the protein sequence for further analysis. In the domain visualization section, hovering the mouse over a specific domain will display detailed information about that domain.

MMseq Alignment Results

Conserved Domain Features

Integrated SeqViz

This module visualizes plasmid in both circular and linear formats. RIP proteins are marked in red, oriV in green, hypothetical proteins in gray, Other CDS in blue, and ncRNA & Regulatory RNA in yellow. Users can easily view the distribution characteristics of replication origin proteins and replication origins. Additionally, the replication origin section provides detailed annotations of potential iterons, DnaA Boxes, AT-rich regions, oriV similar sequences,and conserved site information related to the replication origins of different plasmid types. Users can simply select a sequence and press CTRL+C (Windows) to copy the corresponding sequence for further analysis.

It should be noted that some single-strand origins (sso) of RCR plasmids have also been incorporated into the BLAST pipeline, and are labeled as "sso" in the BLAST hit name .

5. Get more detailed results

Click the Download button in the upper left corner of the results page, and you will download a compressed file of the results, which contains the annotated plasmid genome GBFF file, which can be opened by the user using a local genome browser for further analysis. All_IGSs.csv file that covers detailed scoring, RIP.csv, etc., for each intergene sequence.

When there is no RIP match in the genome, OriV-Finder outputs the highest-scoring intergenic sequence as the oriV, and although some results will have structural homologs of the replication initiation protein nearby, these results require caution and require further analysis.

No RIP Example➔

It is recommended that users submit protein sequences near these origins of replication to AlphaFold3 for structure prediction, and then use Foldseek to search protein structure databases. This approach can further confirm structural similarities between these proteins and replication-related proteins. When combined with Whole-sequence visualization results, this process can further validate the likelihood of the region being an oriV.

Supplemental Materials

OriV-Finder identification results of RIP and ColE1 homologs in PLSDB

Download Table(XLSX)

Potential RIPs identified through OriV-Finder analysis

Download Table(XLSX)

Detailed information on collected RIPs (Database version: 2025-02-13)

Download (ZIP)

Comparison of OriV-Finder and PlasmidFinder in their capacity to identify oriV-related sequences

Download Table(XLSX)

Comparison of OriV-Finder and Ori-Finder 2022 in their capacity to identify oriV s

Download Table(XLSX)

OriV-Finder Docker Image User Guide

Before you begin, please ensure that Docker is installed.

Download Docker Image

This Docker image includes the complete OriV-Finder tool and all its dependencies, which can be run on any system with Docker installed without requiring extra environment configuration.

System requirements

An operating system with Docker installed (Linux, macOS, or Windows)
At least 8GB of available memory
At least 20GB of available disk space

1. Load the Docker image

gunzip -c orivfinder-ready.tar.gz | docker load

2. Verify that the image has been loaded correctly

docker images | grep orivfinder-ready

You should see output similar to the following:

orivfinder-ready    latest    74fb4641bdf2    Created: ...    Size: 11.6GB

3. Run the OriV-Finder container and mount a local directory

mkdir -p data
docker run -it --name orivfinder -v $(pwd)/data:/app/data orivfinder-ready

Now, you can place your input files in the local data directory. These files will be accessible inside the container at /app/data.

To run OriV-Finder within the container:

python oriVfinder.py --fasta /app/data/input.fasta --output_dir /app/data/output

For more options, please run python oriVfinder.py -h to see all parameters.