GC-Profile -- A web-based tool for visualizing and analyzing the variation of GC content in genomic sequences Version: 2.0 Release Date: Oct. 20, 2004; Revision Date: Nov. 15, 2005; Authors: Feng Gao and Chun-Ting Zhang Copy right: Bioinformatic Center, Tianjin University, China Contact: Chun-Ting Zhang, Dept. Physics, Tianjin Univ., Tianjin, China, 300072. ctzhang@tju.edu.cn References: Gao F, Zhang C-T (2005) GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences (submitted) Zhang C-T, Gao F, Zhang R (2005) Segmentation algorithm for DNA sequences. Physical Review E, 72: 041917(1-6). Preamble ------------ GC-Profile executable files are freely available to both academic and commercial users, provided that the applications are properly cited. However, no re-distribution is allowed without written permission of the authors. The program for MS Windows plotforms has been scanned by 2002 version of Norton anti-virus, and has been shown to be free of viruses. Running ------------ GC-Profile is a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences. The algorithm is implemented in the language of C++ and the output graphs are generated by gnuplot graphic routine (http://www.gnuplot.info/). To run the program: GCProfile [options] where, : the sequence in FASTA format. Options are: -t n Choose n as the threshold for segmentation (Default is 1000). -g n Choose n as the gap size to be filtered (If n > 1, n bp is set as the gap size to be filtered. If 0 < n < 1, for example, n = 0.01, means gaps less than 1% of the input sequence length will be filtered. By default, n = 0.01). -i n Choose n as minimum length (Default is 3000 bp). -s n Choose n as the graph size to output. -l Label the coordinates of the obtained segmentation points to cumulative GC profile. -z Plot z' curve instead of -z' curve. -m Set as multiplot mode, in which plots are placed on the same page. -d Plot containing the density of genes, CpG islands, etc. -c Plot containing the coordinates of special points. -h Display the options of the program. Note: To prevent meaningless output results, limits have been placed on some input parameters, such as threshold and minimum length. The input files should be prepared in the following format. The genome file (in FASTA format), >chicken chr28 GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTG GGGTGGGAATTCTTCGGTTGGGAATTCTTGGGGCACTGGGATCTTTTTGA GGTTGGAAATGAAATGGCTGTACTGTAATATTGTTGGGGAGGGAATTTGG ... ... ... NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNN : containing the density of genes, CpG islands, etc. #the density distribution of CpG islands along chicken chr28 which was calculated in 10 kb long, non-overlapping windows 1 2 2 2 3 3 4 5 5 2 6 1 : : : 468 0 469 0 470 0 471 0 472 0 473 0 Note: the first column is window no., and the second column is the counts of CpG islands within the window. : containing the coordinates of special points. # the coordinates of horizontally transferred genes in Vibrio vulnificus CMCP6 chr I from HGT-DB (http://www.fut.es/~debb/HGT/) 6582 8998 10627 11483 12372 : : : 953656 954303 955133 993043 Note: coordinates are listed in order, one on each line. Here, the listed coordinates is the middle position of the corresponding genes. Also note: in and , empty lines, or lines with just white space (spaces and tabs) are ignored. Lines starting with a hash are ignored. The hash symbol (#) is used to start lines that are comments. Speed -------- The computational time increases with the length of the sequence to be segmented and the times (number) of segmentation procedure. It runs within one minute and returns results immediately for most prokaryotic genome sequences on a 3.20 GHz Dell PWS650 workstation with 2 GB RAM. For human and other higher eukaryotic genome sequences, it usually runs within 10 minutes with typical parameters. Example 1: Visualization of isochore organization of eukaryotic genomes -------- Take chicken chromosome 28 as an example, which is described as follows. (i) The system is downloaded from http://tubic.tju.edu.cn/GC-Profile/. All the files would be contained in the same directory, or else the path of each file should be specified. (ii) The input files would be prepared according to the format defined above. (iii) The system is running as follows, GCProfile GGA28.fa -t 300 -d GGA28.CpG -m you will see some intermediate results of the procedure in the screen. *************************************************************************** GC-Profile http://tubic.tju.edu.cn/GC-Profile/ Bioinformatic Center, Tianjin University, Tianjin, China Please cite: Gao F & Zhang C-T, GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences. (submitted) *************************************************************************** Sequence length: 4731479(bp) Sequence length (excluding gaps): 4040210(bp) 1. Segmenting the DNA sequence... Halting parameter = 300 Filtered gap size = 47314 bp Minimum length = 3000 bp Gaps: 4231480-4731479 Gap size: 500000 Contig: 1-4231479 Segmentation strength: 4503.772 Position: 2021042 Segmentation times: 1 Segmentation strength: 3766.841 Position: 722658 Segmentation times: 2 Segmentation strength: 419.029 Position: 52132 Segmentation times: 3 Segmentation strength: 170.950 Position: 35385 Segmentation times: 4 Segmentation strength: 178.373 Position: 299345 Segmentation times: 4 Segmentation strength: 502.761 Position: 1595651 Segmentation times: 3 Segmentation strength: 294.287 Position: 1422803 Segmentation times: 4 Segmentation strength: 954.208 Position: 1879393 Segmentation times: 4 Segmentation strength: 354.600 Position: 1757620 Segmentation times: 5 Segmentation strength: 358.133 Position: 1698610 Segmentation times: 6 Segmentation strength: 122.257 Position: 1601834 Segmentation times: 7 Segmentation strength: 34.224 Position: 1728952 Segmentation times: 7 Segmentation strength: 21.521 Position: 1847150 Segmentation times: 6 Segmentation strength: 68.945 Position: 2012280 Segmentation times: 5 Segmentation strength: 11612.922 Position: 2644230 Segmentation times: 2 Segmentation strength: 22.821 Position: 2580658 Segmentation times: 3 Segmentation strength: 2100.044 Position: 3447846 Segmentation times: 3 Segmentation strength: 599.101 Position: 3228500 Segmentation times: 4 Segmentation strength: 910.317 Position: 3028793 Segmentation times: 5 Segmentation strength: 396.452 Position: 2687837 Segmentation times: 6 Segmentation strength: 37.646 Position: 2678710 Segmentation times: 7 Segmentation strength: 527.293 Position: 2733050 Segmentation times: 7 Segmentation strength: 40.306 Position: 2717895 Segmentation times: 8 Segmentation strength: 294.212 Position: 2802625 Segmentation times: 8 Segmentation strength: 267.126 Position: 3053997 Segmentation times: 6 Segmentation strength: 824.285 Position: 3363805 Segmentation times: 5 Segmentation strength: 123.369 Position: 3252205 Segmentation times: 6 Segmentation strength: 519.445 Position: 3389525 Segmentation times: 6 Segmentation strength: 18.926 Position: 3364431 Segmentation times: 7 Segmentation strength: 39.032 Position: 3390215 Segmentation times: 7 Segmentation strength: 493.638 Position: 4025924 Segmentation times: 4 Segmentation strength: 1060.182 Position: 3720812 Segmentation times: 5 Segmentation strength: 141.020 Position: 3492716 Segmentation times: 6 Segmentation strength: 136.457 Position: 3816746 Segmentation times: 6 Segmentation strength: 427.887 Position: 4120934 Segmentation times: 5 Segmentation strength: 25.896 Position: 4109834 Segmentation times: 6 Segmentation strength: 211.979 Position: 4210353 Segmentation times: 6 2. Calculating and outputing GC profile coordinates 10% output! 20% output! 30% output! 40% output! 50% output! 60% output! 70% output! 80% output! 90% output! 100% output! (iv) The system outputs is listed as follows: (1) Coordinates, sizes and G+C contents of the segmented domains as an HTML table (GGA28_GCcontent.html) Halting parameter = 300.00 Filtered gap size = 47314 bp Minimum length = 3000 bp Start (bp) Stop (bp) Length (bp) GC content (%) 1 35385 35385 49.62 35386 52132 16747 58.43 52133 299345 247213 44.28 ... ... ... 4025925 4120934 95010 39.79 4120935 4210353 89419 44.78 4210354 4231479 21126 52.71 4231480 4731479 500000 - (2)Number, coordinates, segmentation strength, segmentation times and segmented contig of the segmentation points as an HTML table (GGA28_SegPoints.html) No. Segmentation points Segmentation strength Segmentation times Segmented contig 1 35385 170.95 4 1-4231479 2 52132 419.03 3 1-4231479 3 299345 178.37 4 1-4231479 ... ... ... 55 3941472 296.09 7 1-4231479 56 4025924 493.64 4 1-4231479 57 4120934 427.89 5 1-4231479 58 4210353 211.98 6 1-4231479 (3) The coordinates files of cumulative GC profile (GGA28.GCprofile), G+C content of the segmented domains (GGA28.SegGC) and segmentation points (GGA28.SegP). Output graphs can be generated by gnuplot graphic routine (http://www.gnuplot.info/) by loading a gnuplot command file (GGA28_GCprofile.plt) directly. The users need to install the graphics freeware, gnuplot where can be downloaded from the website http://www.gnuplot.info/. The gnuplot executables -- wgnuplot.exe (windows) or gnuplot (linux and irix) are also available in the package. You can type "load 'GGA28_GCprofile.plt' " (no qutoes) at the gnuplot prompt to see the output graphs. Precompiled binary distributions are also available for your convenience. If you have problems using these binary distributions to output graphs, you should compile it yourself. Of course, you can plot these data files with other scientific plotting software, such as Origin, Matlab, S-Plus, SPSS etc. Example 2: Identification of genomic islands in prokaryotic genomes -------- As an example, the genome of Vibrio vulnificus CMCP6 chromosome I is shown here. The process is similar to Example 1, so it is only described briefly. (i) The system is downloaded from http://tubic.tju.edu.cn/GC-Profile/. All the files would be contained in the same directory, or else the path of each file should be specified. (ii) The input files would be prepared according to the format defined above. (iii) The system is running as follows, GCProfile Vvu.fa -t 100 -i 1000 -c Vvu.HGT -m -l (iv) The system outputs is listed as follows: (1) Coordinates, sizes and G+C contents of the segmented domains as an HTML table (Vvu_GCcontent.html) (2) Number, coordinates, segmentation strength, segmentation times and segmented contig of the segmentation points as an HTML table (Vvu_SegPoints.html) (3) The coordinates files of cumulative GC profile (Vvu.GCprofile), G+C content of the segmented domains (Vvu.SegGC), segmentation points (Vvu.SegP) and the coordinates of horizontally transferred genes (Vvu.TagP). Output graphs can be generated by gnuplot graphic routine (http://www.gnuplot.info/) by loading a gnuplot command file (Vvu_GCprofile.plt) directly. The users need to install the graphics freeware, gnuplot, where can be downloaded from the website http://www.gnuplot.info/.The gnuplot executables -- wgnuplot.exe (windows) or gnuplot (linux and irix) are also available in the package.You can type "load 'Vvu_GCprofile.plt' " (no qutoes) at the gnuplot prompt to see the output graphs. If you have problems using these binary distributions to output graphs, you should compile it yourself. Of course, you can plot these data files with other scientific plotting software, such as Origin, Matlab, S-Plus, SPSS etc.