This blog will discuss the content and creation of systematic correction files. Sometimes a wave-like pattern is seen in the probe distribution across chromosomes. This artifact can affect the accuracy of calls. Systematic correction based on GC content of the probe sequence and its neighborhood can be applied to minimize this problem.
The correction file is a tab-delimited text file with a variable number of columns following a header line as shown below:
Location gc4kb gc100kb gc1mb gcprobe
chr1:1301-1351 0.3825 0.3780 0.3767 0.4615
chr1:1350-1399 0.3731 0.3781 0.3769 0.4625
The first column is self-explanatory. The gc4kb column contains the fraction of Gs & Cs in the genomic sequence 2kb on either up/downstream of the probe center. Similarly the gc100kb and gc1mb columns contain the GC fraction of the sequence on either side of the probe center. The gcprobe column is the fraction of GCs in the probe sequence. For example, a probe with sequence AAGGCCTT will show 0.5 in the gcprobe column.
These files are created using an internal web tool that calculates the GC fraction from the genomic sequence in fasta files downloaded from the UCSC genome browser. After a bed file containing probe locations is loaded into the web tool, the probe sequence is extracted from fasta files and Gs & Cs are counted to calcualte the fraction. This is repeated for extended sequences around the center of the probe.
The systematic correction web tool can produce correction files for various genomes including human, mouse, even wheat and builds such as hg18/19/38, mm8/9/10 and more can be added as requested.