The so-called “Raw Data” for the data analysis in Nexus Copy Number can be intensity values for array probes (Affymetrix CEL files), logRatios after normalization (Agilent Feature Extraction results files and Illumina final report files), or even copy number state values (Affymetrix OncoScan data). Let’s take a look at how Nexus Copy Number handles the different “Raw Data” and what is common in the overall data processing workflow.
Affymetrix CEL files (intensity values) need to be normalized to generate logRatios before proceeding to copy number analysis. A normal reference file generated from the HapMap pooled samples is used for this purpose. Depending on the Affymetrix array types, different reference files are generated for 500K, SNP 6.0, and CytoScan HD. Each CEL file from the test samples is used against the corresponding reference file to get logRatios for all the probes on the array. A systematic correction process follows to straighten up the systematic waviness pattern of the data distribution, which results mainly from the G-C contents of the probes and the samples, as well as from other factors. The data is finally analyzed to get the copy number calls using the default SNP-FASST2 segmentation algorithm.
LogRatio data, such as Agilent Feature Extraction results files or Illumina final report files, are directly processed by FASST2 (for aCGH arrays) or SNP-FASST2 (for SNP arrays) to get copy number calls, usually preceded by the systematic correction step for possible data waviness.
A special case is Affymetrix OncoScan data files, Copy_Number.txt and Assays.txt. The “Raw Data” are probe copy number state values, which are linear values rather than logRatios. However, Nexus Copy Number processes the data similarly to the logRatios, i.e. systematic correction followed by copy number segmentation with SNP-FASST2.
As described above, the last step for the data analysis workflow is the copy number segmentation, which results in the copy number calls for the copy number segments. What this step ultimately does is to decide whether a group of probes should be in a distinct segment or stay with the current segment of neighboring probes. The default copy number segmentation algorithm in Nexus Copy Number is FASST2 (for aCGH arrays) or SNP-FASST2 (for SNP arrays), which is based on the popular Hidden Markov Model (HMM) algorithm. However, unlike the conventional HMM, no integer copy number states (e.g. 0, 1, 2, 3, and 4) are used. Instead, the states are defined by the copy number calling thresholds, which are based on logRatios (linear values in Affymetrix OncoScan Data) and can be easily adjusted by the end user. The number of segments is determined by the Significance Threshold setting, which is equivalent to a P-value cut-off and is also adjustable to the user. The probability that a group of probes belong to a certain copy number state is compared to this Significance Threshold and a new segment is generated if the Significant Threshold is surpassed. The value of the new segment, median value from the logRatios (linear values in Affymetrix OncoScan Data) of the group of probes in the segment, is then compared to the calling thresholds (one copy gain, one copy loss, gain with two or more copies, and homozygous loss) to get the corresponding copy number calls
In summary, there are three major steps in the data analysis workflow from “Raw Data” to copy number calls: