This video teaches the basic principles of array CGH including how arrays measure copy number, signal intensities, probe mapping, experiment/reference sample ratios, etc.
So let’s get started with the basic principles of DNA copy number estimation.
The idea is that what you’re trying to measure is the copies of DNA. So if you have probes that are placed across the genome at different positions, these probes are placed on a microarray, and each of them will have the particular position and they can be at any point as long as you have the probe mapping.
But then the experiment is that you take the test sample, and you color that DNA in green, your reference sample in red (these are just made up colors of course) and you combine them and hybridize onto an array CGH platform.
So if you have the same amount, so if you have two copies deployed of both your tests and reference, you have equal amounts of green and red, which forms these yellow spots. So everywhere on the genome ends up being yellow.
Now, if you have more of a particular area, like here we have more of the P arm of chromosome one, and the Q arm has a deletion, so there’s more green for the probes the map to this location, so those probws end up being greener. And the probes that we have less material, our test samples and the reference, end up being in red. So we end up with this red, yellow, and green profile.
Now, if you look at the actual numbers for that, the measurements, not the color, so you get an experiment channel, you end up with some measure, let’s say 150, and the control will be 100. So you have a ratio of three over two, or if we are working in a log 2 space, so the log 2 of this is +0.57. For the next probe, you have 300 over 200, same ratio, 0.57, and then you have equal amount in both channels, log ratio becomes zero. And if you have less than your test, set in your control, you end up with a negative log ratio. And typically that’s plotted as such, so you have on the genome like this, so you end up saying, “Ah, these probes that are above zero, and 0.57, this is the area of gain. This is normal, they’re right at zero. And the losses are where we have negative numbers. So this is very simple, basic concepts of array CGH.
Now, there’s a lot of interest in next generation sequencing and using that type of data for making copy number estimation. And sometimes it’s quite simple and analogous.
So in NGS, using NGS technologies, you have relatively short reads. And if you have corrected for different biases, amplification biases, UC biases, the number of reads that you get at a particular position in the genome should indicate some relative amounts of the DNA. So if you’ve bin up the genome into multiple bins and count, so we have six reads here, six reads here, seven, so on and so forth. And you plot that, let’s say they had 6x coverage, on on average, and you ended up with 14, so this part might be again, again, if you’ve corrected for biases.
One way of correcting for biases would be to do like an array CGH to a comparative approach. So you compare a test to a reference. So we assume this is a reference and your test sample and you use the same protocol for measurements, you can then say, well, you know, in each bin, I’m copying nine here versus six. So you can create a log ratio. So as an area like here, where you have like no read vs. where you have high rates before this might indicate a loss. So you can transform that data into what we look at like an array CGH. So essentially, you can treat the NGS data as if it came from array or back arrays or anything.
Everything else I’m going to talk about could be applied them to the NGS-derived data. So here’s an example—a more realistic example that probes no longer just lands perfectly on 0.57. There’s some noise involved here. And if you look at this data, one way of deciding where the gains and losses are, is you say, well, these probes around 0.57, that must be a gain. This guy’s low down here, that must be a loss. Gain. Normal. And you can kind of go through the genome like that.
On the other hand, you could say, well, you know, I’m not so sure about that, that single probe might be noise that the system, so I might have a complete gain here, similar to this point, that might actually be noise and part of the normal effect. And with this guy there, that might be noise, so the whole thing could be a loss. How would you know, that thats noise or not? I mean, frankly, you wouldn’t necessarily know. After that, you can hope with some statistics that based on the distribution of the probe, you have certain confidence that this is, is really noise or not, it should be incorporated. So we’ll get to that in the next session.
So before looking at statistical approaches a simple, very crude approach, which is was initially proposed would be to just count the number of probes that are above the zero line. So if you say, take a window of five probes, and slide it and say, “Okay, well, I have five sequences above zero line, I call that a gain.” So here I get five and get a gain. And as I moved the window, that’s pretty much it.
It’s not the best method because it can make bad calls. So in the case, like here, you have five probes that are slightly above zero config just by 10 and that becomes the gain. And in this case, because you have an outlier, you’ll never get five probes in a row and potentially real gains will not be called. So statistical methods have been developed that tries to give you a better solution, making calls and that’s going to be part of the next session.
Check out this webinar recording relating to Copy Number Estimation from Exome and Genome Sequencing Data