Master’s Thesis
Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis
Thesis Committee: Prof.dr.ir. M.J.T. Reinders Dr.ir. J. de Ridder Dr. C.P. Botha, MSc Dr. E.A. Sistermans Dr. M.M. Weiss Ir. J. Nijkamp
Author Email Student number Thesis supervisors
Date
Information and Communication Theory Group
T
[I,C)
Daphne M. van Beek
[email protected] 1314114
Prof.dr.ir. M.J.T. Reinders Dr. M.M. Weiss Dr. E.A. Sistermans October 23, 2012
Preface This report is made as part of the Master’s Thesis project of the master Computer Science, track Bioinformatics at the Delft University of technology. The main focus of this document lies on the paper that is written as the result of my research on the detection of copy number variations using next generation sequencing data. In the future it is likely that next generation sequencing techniques will replace the current array-comparative genomic hybridization technique that is currently used in clinics. For next generation sequencing data to replace this technique, the minimal coverage required for competitive detection should be known; this was the focus of my work. Besides the main paper, a supplement is provided to give some additional information about the research that was done. A work document is also included, in this document the progress and observations made during the project are registered. The Master’s thesis project was done at the Bioinformatics Lab at Delft University of Technology in collaboration with the department of Clinical Genetics of the VU Medical Center in Amsterdam.
Acknowledgements Thanks to Janneke Weiss, Marcel Reinders and Erik Sistermans for their advice, supervision and all the interesting discussions. Thanks to Desiree Steenbeek, Daoud Sie, Quinten Waisfisz and Bauke Ylstra for the tips and their help on both aCGH and NGS issues I encountered during the project. Thanks to Roy Straver for the collaboration in the starting phase of our thesis projects. It was great that we could merge our ideas and even ’invented’ our own, new alignment method (now we only have to make sure the code works)!
1
Master’s Thesis
October 23, 2012 Pages 1– 21
Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-Comparative Genomic Hybridization Analysis D.M. van Beek 1,⇤, M.M. Weiss 2 , D. Sie 3 , B. Ylstra 3 , M.J.T. Reinders 1 and E.A. Sistermans 2 Delft Bioinformatics Laboratory, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands 2 Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands 3 Department of Pathology, VU University Medical Center, De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands 1
Defence date: October 23, 2012 Supervisors: M.J.T. Reinders, M.M. Weiss and E.A. Sistermans
ABSTRACT Motivation: Copy Number Variations (CNVs) are known to be involved in various disease phenotypes. Detection of CNVs is done clinically by performing aCGH analysis. Next Generation Sequencing (NGS) techniques are a promising replacement for this analysis, due to their higher resolution. To make NGS CNV detection applicable in the clinic, the method should be competitive with aCGH. In terms of costs, this could be reached by shallow whole genome sequencing (< 0.5x coverage). It is not yet known what the minimal coverage required for competitive reliable analysis should be. Here we present a thorough comparison of detected CNVs by both aCGH and NGS for different coverages of NGS data. Results: Analysis of NGS data with a coverage of 0.4x after alignment and pre-processing produces a list of CNVs that largely overlaps with regions detected by aCGH and that show little additional detected CNVs. A lower coverage generates much more CNVs that probably are erroneously detected. Dilution experiments show that it is probable that even lower coverage can be used (0.23x). In depth analysis also revealed that it is essential to consider the read-depth plots of both test and control sample, besides their log2 ratio, as this provides valuable information about the origin of the CNV. To reduce the number of false positives considerably, we introduced a rule-filter that removes on average 80% of the unreliable CNVs. The best choice for the parameters of the CNV detection tools is very variable and depends largely on the coverage of both test and control sample. Therefore, we conclude that for application in the clinic a quality control system should be developed. Contact:
[email protected]
1
INTRODUCTION
Copy number changes may indicate (potential) pathogenic insertions and deletions that can be explanatory towards a specific phenotype (for example intellectual disability or forms of heredital cancer [Vissers et al., 2009], [Campbell et al., 2008]). CNVs ⇤ to
whom correspondence should be addressed
c All rights reserved.
in specific regions of the genome may influence regulation and expression of genes, which makes detection of these aberrations important in localizing causal genes of a patient’s phenotype. There are numerous methods available to detect copy number variations (CNVs) in the genome. The current standard to detect CNVs in clinics is to perform array-comparative genomic hybridization (aCGH) analysis [Pinkel et al., 1998], [Holstege, 2010]. Next Generation Sequencing (NGS) is becoming less expensive, better applicable and can achieve a higher resolution: NGS makes it possible to detect smaller CNVs and is not limited to predefined probe positioning. Therefore, the use of NGS data for the detection of these disease-related CNVs is an attractive option to consider. The NGS methods developed are mainly focused on the detection of aberrations in cancer samples and sequencing data with high coverage (>10x). To improve the detection of disease-related CNVs, other methods need to be evaluated. The CNVs that need to be detected are generally smaller (aCGH resolution is around 50 kb, but smaller CNVs may be involved) and can be situated anywhere on the genome. Early methods for detection of CNVs using NGS read data were read-depth based and mainly focused on the detection of copy number changes and chromosomal rearrangements in cancer [Campbell et al., 2008]. Numerous aberrations (both insertions and deletions) can be observed over a chromosome of a genome affected by cancer. Read-depth methods assume that reads are mapped to the genome according to a distribution (for example, a Poisson distribution [Yoon et al., 2009]). As a duplicated region exists twice, it will have twice as many reads sequenced; thus alignment to a reference genome will result in a higher number of reads aligned in this region (the other way around for a deletion) [Bailey et al., 2002]. Later read-depth methods also focused on finding CNVs in other types of tissue, as it became known that CNVs play a large role in the development of diseases. Paired-end methods using NGS were mentioned for the first time in 2007 [Korbel et al., 2007] and were also mainly used for CNV detection in cancer samples. Currently the paired-end approach is the most-used NGS method in the detection of CNVs,
1
D.M. van Beek et al
as the detection itself is less dependent on the natural variation of coverage that occurs when using whole genome sequencing. Pairedend sequencing generates a read at each end of a longer DNA strand (200 - 500 bases). As the length of the total strand is known, the distance between the mapping positions of the two paired reads should be the same. If this distance is longer on the reference genome, a deletion is located in the sample; if shorter, an insertion has taken place. Other aberrations can also be detected; if one of the paired reads is mapped to the reference genome in opposite direction, its corresponding region is inverted. To compete with aCGH, the costs of the NGS method should be similar or lower. Paired-end sequencing takes twice as long and is about twice as expensive as single-end sequencing. This is the main reason why we chose to work with single-end whole genome shallow sequencing. Shallow sequencing means that not the whole genome is covered, but only a small amount, reducing sequencing costs. Currently, the costs of shallow sequencing 20% of the genome (indicated by 0.2x) is comparable to the cost of one aCGH analysis. With 0.2x coverage, on average one read per 255 base pairs is aligned. When looking at CNV regions that are larger than this minimum, using less than 1x coverage is still reliable. To decide for replacing aCGH with NGS in the clinic, it is necessary to know the minimum coverage that is needed to detect the same CNVs that are currently detected using the aCGH Agilent 180k oligo microarray. The probes on this microarray are 60 base pairs in length and on average 1 probe per 17851 base pairs is encountered, covering about 3.5% of the genome. As it is possible to achieve a higher resolution using NGS and NGS is not limited by predefined probe positions, we are also interested whether we can detect CNVs that are smaller than the smallest CNVs found by aCGH (which can detect CNVs of >50 kb in length). To achieve this, a suitable CNV detection tool should be selected that meets our requirements.
2 2.1
RESULTS Approach
The schematic overview of the CNV detection pipelines are displayed in figure 1 and will be discussed in this section. In short, the NGS pipeline (figure 1A) consists of the alignment of single-end reads using a reference genome and a pre-processing step to reduce read depth biases resulting from repetitive regions (reducing the effects of read-towers by application of the REad-Tower RemOval filter; RETRO) and PCR duplicates. After reducing the biases created by repetitive regions and PCR duplicates, the mapped data is analyzed by a CNV detection tool. The output is further processed by a rule-filter to reduce the number of false positives in the data. Array CGH (figure 1B) is a microarray technology that makes use of a control and a test sample, both labeled with a different fluorescent dye. After hybridization of the sample DNA to the probes on the microarray, differences in color brightness indicate which sample has more DNA attached and copy number changes can be determined using this information. Aberrated regions are detected by a tool that segments and merges similar log2 ratio’s. We applied Nexus copy number [Darvishi, 2010]. To replace the current aCGH practice by NGS methods, it is important to establish the minimal coverage at which reliable results are generated. To determine this minimal coverage one can
2
dilute a high coverage data set (>100x) and find the dilution at which regions found in the high coverage data set are not detected anymore. Alternatively we can compare regions detected using low coverage data to the aCGH results. Non-tumor human samples of over 100x coverage sequenced using an Illumina HiSeq2000 were not available online (we restricted ourselves to Illumina HiSeq2000 as the sequencing platforms influence sequencing results due to different biases and error rates [Lam et al., 2012]) and we chose not to use data from another platform as we do not know to what extent this influences our CNV detection. Sequencing a sample at 100x coverage was too expensive, which left us with the alternative low coverage approach by defining the aCGH output as ’standard of truth’. The available NGS sample data is processed by four different CNV detection tools. These tools all use different algorithms and can be divided into two groups; read-depth based and control-based. The read-depth based tools look at the positioning of the reads after mapping and can distinguish between insertions and deletions in the sample by deviating amount of reads aligned in a specific region. The control-based tools also apply this read-depth method, but in addition look at a control sample that is sequenced in the same manner as the test sample. As it is assumed that some of the biases in the test sample will also occur in the control sample, this control sample is used to reduce the effects of the bias, mainly by comparing the ratio between the test and control sample reads. To choose an optimal NGS tool for our purpose, scoring methods were developed to compare the output of the aCGH and NGS methods. Four different NGS CNV detection tools were evaluated; 2 are based on the read-depth and 2 on the control-based approach. For the control-based tools we experimented with several control samples at different coverages as input. To reliably compare aCGH output to NGS output, the control sample used in the aCGH experiments was also used here. This control sample was sequenced in three different runs, thus it can also be used to see if there are noticeable differences in the results when changing the control sample coverages (see figure 8 for details). An external control sample was also used as input; a sample pool of DNA obtained from pregnant women (in which about 5% of the reads are of fetal origin [Lo et al., 1997]), to determine if it is possible to use an external (non-commercial) control sample. After extensively comparing NGS and aCGH outcomes, we came to the conclusion that the aCGH output could not be considered as ’standard of truth’. The aCGH method did not find all CNVs and some of the detected aCGH regions turned out be wrongly detected after close inspection. Therefore, we manually annotated both the aCGH and NGS outcomes and calculated some statistics to determine distinguishing features for the detected regions that are considered true CNVs. Using these features a rule-filter was created, which aims to reduce the number of false positives generated by the CNV detection tool. In the case of large aberrations, the NGS tool may return detected regions that are overlapping. These regions are merged after application of the rule-filter to make sure the number of detected regions resulting from the tool is correct. There are multiple factors that can influence how a sample is sequenced, for example temperature and the chemical composition of the sample that is to be sequenced. Small differences in these factors can introduce a bias in the read-depth, but as it is not known the exact extent to these influences, we needed to determine the importance of sequencing the control sample on the same lane as
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
Next Generation Sequencing Pipeline
2. Number of false positives and false negatives should be low, as a large number indicates that the tool’s performance is not good enough.
Array CGH Pipeline
Read data
Labeling Sample
Labeling Reference
3. Run time should be no more than a day. Alignment
Hybridization on Microarray
BWA
Pre-processing
Scanning Microarray
CNV detect
Analysis by Feature Extraction
Reference data R1 R2 R3 R1,2,3
A
Manual reannotation calls
Rule-filtering
aCGH calls
Validation
Calls generated by Nexus
B
Fig. 1. Next generation sequencing and aCHG pipelines for the detection of CNVs. The steps described are explained in more detail in the methods and results section. In figure A, orange indicates a list of choices that is evaluated. For example; four CNV detection tools are tested and an optimal tool and optimal settings are chosen based on ROC-analysis; the alignment of the reads is optimized to allow for SNPs. A manual annotation of all the CNVs generated by aCGH and the NGS pipeline is performed and the results are used to develop a rule-filter to reduce false positives. In figure B, the general overview of the aCGH technology is displayed.
the sample. This is done by comparing results between multiplexed and stand-alone test and control samples.
2.2 Comparing aCGH and NGS results We would like that the outcome of the detected CNV regions based on NGS data resemble as much as possible the regions detected by aCGH. Therefore we define the following situations. True positives (T P ) are defined as the number of CNV regions found by aCGH that are also found by the NGS CNV detection tools (an overlap of the detected regions is observed). If multiple NGS regions overlap one aCGH region, it will be counted as one TP. False positives (F P ) are detected regions that are found by the NGS tool, but not by aCGH. False negatives (F N ) are aCGH regions not found by the NGS tool. The first and second criteria for selection are slightly correlated, the more regions the NGS tool detects (T P and F P combined), the higher the chance that the detected regions overlap randomly with the aCGH data (T P ).
2.3 DWAC-seq is the best CNV detection tool for analyzing the NGS data Two types of tools are evaluated in this section; read-depth based (CNVnator [Abyzov et al., 2011], RDXplorer [Yoon et al., 2009]) and control-based (DWAC-seq [Koval and Guryev, 2011], CNV-seq [Xie and Tammi, 2009]). We compare these four CNV detection tools for the NGS data using the following criteria: 1. Calls as reported by the aCGH analysis should be found using the NGS data.
These four CNV detection tools are applied to four samples (see Methods for details) and table 1 shows an overview of the outcomes for each of these four samples. The first striking observation is that the number of total calls1 differs quite a lot between the tools (from about 50 to even 10.000). A high number of false positives is, however, not trustworthy and likely do not indicate true CNV regions2 . For that reason DWAC-seq and RDXplorer are the best candidates as they generate the lowest amounts of calls. Between DWAC-seq and RDXplorer, DWAC-seq has the highest number of true positives (48% over 25%) and is, therefore, selected as best CNV detection tool. The second observation is that samples 1 and 2 show different behavior when compared to samples 3 and 4. The number of total calls of sample 3 and 4 is a lot higher (twice as high or more). This is due to the difference in coverage; sample 1 and 2 have a coverage of 0.4 and 0.5 fold respectively, while samples 3 and 4 have a much higher coverage; 1.4 and 3.6 fold respectively. The detected CNV regions within the aCGH data have been subjected to clinical analysis. Some of these regions have been indicated as being relevant for diagnosis, they are classified in the following types: Type II, probably/possibly benign; Type III, probably/possibly benign and Type IV, pathogenic. These regions should thus not be missed when using the NGS data for CNV detection. The calls overlapping with these regions can be found between parenthesis in the columns in table 1. As can be seen CNVnator performs best: it finds all of these regions. DWAC-seq performs second best, only missing two regions. As the number of false positive is very high for CNV-nator, we still make the choice for DWAC-seq (see supplement section 6.9 for more details about the two missed clinical calls).
2.4
Finding the best parameter settings for DWAC-seq
DWAC-seq is optimized by considering two parameters of the tool that can be varied: threshold, T , that the method uses for deciding whether the signal is strong enough to make a CNV call, and the size of the window, W , which is defined in the number of reads (instead of number of bases) and drives the minimal size of the regions that can be detected (see Methods for details). We varied the threshold between 0.1 and 0.5, and the window size between 100 and 5000 base pairs and for each setting we compared the results with the detected regions according to the aCGH data. The resulting ROC-curves and the accompanying analysis can be found in the Supplement (6.6). These ROC-curves show that the setting of parameter T is not very sensitive, whereas varying the window size showed large variability over all samples. Based on 1
To indicate a detected CNV region the term ’call’ is used. As the real CNV regions are not known care should be taken with drawing the conclusion that false positive calls are no real CNVs. However, as the aCGH analysis and also two of the four CNV detection tools indicate a low amount of CNV regions, it is likely that those regions actually will not be true CNV regions. 2
3
D.M. van Beek et al
Table 1. Number of true positives (TP), total number of calls (T P + F P ) and the run time for the four selected NGS CNV detection tools. The control sample used by DWAC-seq and CNV-seq is AR1 (supplement S2). The test samples are aligned to the reference genome using the strict setting (see section 5.2). The first columns indicate the total number of calls made when using the aCGH data (CGH calls) and the number of regions that were found to relevant for diagnosis after clinical analysis (Clinical calls). The results of CNVnator are calculated for a window size of 1 and 10 kb respectively. In between the parenthesis are the number of calls that overlap with the clinically relevant calls as found by using the aCGH data. For CNVnator and RDXplorer only calls larger than 1 kb are kept. CNV-seq calculates for each sample an optimal window size. The (t) column states the total number of call regions that are found and the (c) column states the number of calls after combining regions. The runtime is an approximation of the running of the algorithm on data with a coverage of approximately 1x. Sample
CGH calls
Clinical calls
TP
1 2 3 4
27 27 35 24
1 1 4 5
Total calls
1 2 3 4
Run time (1x)
these observations, we set T equal to 0.25 and chose the smallest window size for which the number of true positives is nearly maximal while the false positives are still relatively low (W =100 reads per window). The window size is dependent on the coverage of the sample, but we also observed that the window size heavily depends on the coverage of the control (table 5). This is because DWAC-seq uses the control sample for determining the location of the windows (see supplement section 6.5.3). As can be seen in table 5, when the coverage of the control sample increases from 1.3 to 4.5 (3.5 x higher coverage), the number of calls rise considerably (from 69 to 211 for sample 3) when the window size is kept constant. Note that the window size is defined by the number of reads, which means that when the coverage increases that a window size fixed in number of reads decreases in terms of bases. That implies that smaller regions can be detected, but which in turn increases the chance for false positives. Hence, the window size needs to be set dependent on the coverage of the control sample (i.e. larger window size when coverage is higher). Previous studies revealed that an average of 12 copy number variants per individual are present [Feuk et al., 2006]. We have used this in order to reason that when a method finds much higher numbers they are likely to be false. In the aCGH results, the average number of CNVs found is ⇠28, which complies with the findings of Feuk et al. As the resolution of NGS is higher than aCGH, we expect to find more abberated regions. Additionally DWAC-seq tends to separate large abberated regions into smaller ones, which also would generate more detected regions. With the chosen settings of DWQCseq, we find around 50 abberated regions which we think is in good agreement with what we would expect reasoning from the findings of Feuk et al.
2.5
The composition of the control sample has a large influence on the found CNVs
To reliably compare the aCGH and NGS results, the control sample that is used in the aCGH analysis is also used as a control sample
4
CNVnator
RDXplorer
DWAC-seq
1 kb
10 kb
100 bp
100 bp
21 (1) 27 (1) 34 (4) 24 (5)
17 (1) 27 (1) 30 (4) 20 (4)
1 (0) 3 (0) 8 (1) 9 (0)
17 (1) 13 (1) 25 (2) 18 (5)
771 872 4876 3445
449 61 586 571
68 69 150 119
47 42 66 44
± 4 hours
± 10 hours
± 2 hours
CNV-seq (t)
(c)
15 (1) 17 (0) 26 (2) 19 (4) 648 920 6947 10026
115 149 469 718
± 30 min
when using DWAC-seq. Hence, the higher the coverage of the control sample, the more accurate results are produced by DWACseq. Control sample A (see table ??) that is used for both pipelines is a commercial product by Kreatech Diagnostics [Kre, 2010]. The control sample is genomic DNA isolated from whole blood samples of 100 female anonymous donors. To reliably perform CNV detection, the control sample should be diverse enough to exclude CNVs that are particular for the heritage of just one person. As DWAC-seq uses the control sample in order to detect regions that have an aberrated copy number, variations can be found when the test sample varies, but also when the control sample varies. An example is depicted in figure 2. An explanation of the decrease in the read-depth of the control sample can be that there is a deletion (with respect to the reference genome) located at the position in the control samples. Note that this deletion should be consistent over the 100 persons from which the control sample is created, indicating a population bias for the control sample. Hence, when the test sample is not of the same population we will make false calls, as is the case for this example. Ideally for these experiments, the control should consist of DNA of a large number of individuals originated from all sorts of backgrounds. Creating such a control sample is, however very expensive. Therefore, we conclude that it is essential to visualize the read-depths of both the test sample, as well as the control sample for every call made by the CNV detection tools.
2.6
The strictness of the alignment of the reads does not have a large influence on the found CNVs
Most experiments are performed using both the strict (no mismatches allowed) and lenient (one mismatch allowed) alignment methods (see Methods for details). Both alignment methods are compared to see which one obtains an optimal result. The run time of both alignments is not very different. Table 2 gives an overview of the CNV detection results when using both alignment methods. Ideally for one of the alignment methods to be superior to the other,
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
additional factors (window size was less ideal for this sample for example) may also play a role.
Fig. 2. A CNV (false positive) call made by DWAC-seq. This CNV is located at chromosome 1. Shown are the log2 ratio (top panel); read-depth of the test sample (2nd panel from the top); read-depth of the control sample (3rd panel from the top) and the DWAC-seq output signal (bottom panel). The DWAC-seq output signal is the ratio test/ref corrected by the median for each segment (which is the region between two breakpoints). The decrease in read-depth in the control sample indicates that all control subjects appear to have a deletion at this position with respect to the reference genome, indicating a population bias. From the log2 ratio (top panel) one can not discern the origin of the CNV that is displayed.
Detailed comparison of aCGH and NGS results
After the optimization of DWAC-seq, we can compare the results for the aCGH and NGS data in more detail. We do so by plotting the read-depth of both the test and control sample and the log2 ratio for the detected regions. The read-depth is calculated by dividing the mapped reads into bins of 1000 bases. The log2 ratio is calculated by test log2 ( RD ). Figure 3 shows an example of such a plot. It shows RDref a detected region by DWAC-seq (bottom panel) that is caused by a deletion in the test sample (2nd panel from the top) which also results in a negative log2 ratio (top panel). This is an example of a TP as the region is detected based on both the aCGH (orange lines) as well as the NGS data (red lines). Close inspection of a number of these plots, however, revealed that: 1) some F N s are no true CNVs, i.e. the NGS clearly does not indicate an aberration and the aCGH seem to be misanalyzed (figure 6 shows an example); 2) not all F P calls are false, i.e. they clearly show an aberration in the NGS data which is not picked up on the basis of the aCGH data (figure 5 shows such an example); and 3) not all T P calls are true CNVs, i.e. the aCGH data is misanalyzed (figure 4 shows such an example).
1e-6 Normalized RD Sample
Control
2.7
1.5 1.0 0.5
median Normalized RD Control
is that the number of unique false negatives is low and the number of unique true positives high. ’U nique’ in this context means that the call is not found by the other alignment method. One of the low and one of the high coverage samples (sample 2 and 4) display a better performance for the lenient alignment. The number of unique false negatives is lower than the strict approach and the number of unique true positives is higher or the same. The other two samples (sample 1 and 3), however, show the opposite result. Based on this knowledge and on the fact that a lenient approach allows for SNPs, the lenient setting is chosen as optimal. It is interesting to see what causes the difference between the two alignment methods. A large number of calls that are classified as unique are examples of calls that lie in a region in which the test sample has very low coverage; the control sample contains an expected number of reads in the region, but the sample contains no reads at this position; or they lie at a region bordering another CNV. Calls that have a very low coverage as compared to average in both test and control sample are considered not reliable. As most unique calls can be considered unreliable due to low coverage, this is probably the reason why some calls are found using one setting and not the other setting. The explanation for the mediocre performance of the lenient alignment settings for sample 3 may be explained by the large number of total calls that are made, but
0 3.30 1e-6
1.5 1.0
0.5 0 3.30
Fig. 3. A true positive found in sample 1 using the optimized DWAC-seq settings. For an explanation of the figure see figure 2. The red vertical lines indicate the start and stop positions of the region detected based on the NGS data, the yellow vertical lines indicate the start and stop positions of the region as detected based on the aCGH data. The figure clearly displays a deletion in the test sample. Comparing the read-depths between the test sample and the control sample results in an understanding of the resulting log2 ratio.
5
D.M. van Beek et al
Table 2. Results after comparison of strict (no mismatch allowed) and lenient (1 mismatch per read allowed) alignment methods for the four samples. The alignment type is noted for each sample by an ’0’ or ’1’ for allowing 0 and 1 mismatches respectively. The number stated is the number of calls made for each class TP, FP and FN. The number between the parentheses gives the number of unique calls for that sample and alignment type that are not called by the other alignment type. DWAC-seq is run with control sample AR1 that has a coverage of 1.2x for the strict alignment, 1.3x for the lenient setting.
Coverage Class TP FP FN Clinical Total
Sample 1 0.318x 0.431x
Sample 2 0.378x 0.510x
Sample 3 2.804x 3.634x
Sample 4 1.243x 1.374x
0
1
0
1
0
1
0
1
20 (3) 28 (14) 10 (1) 1/1
23 (4) 25 (12) 12 (3) 1/1
16 (2) 26 (15) 14 (4) 1/1
17 (4) 23 (13) 11 (1) 1/1
44 (1) 28 (9) 10 (0) 2/4
34 (0) 40 (19) 11 (1) 2/4
18 (2) 26 (4) 7 (3) 5/5
21 (2) 29 (7) 6 (2) 5/5
60
60
57
51
82
68
51
56
Normalized RD Sample
1e-6 2.0 1.5 1.0
0 1,430 1e-6
2.0 1.5 1.0 0.5 0 1.430
Normalized RD Control
median Normalized RD Control
0.5
Fig. 4. A true positive found in sample 2 using the optimal DWAC-seq settings. The log2 ratio is relatively variable, which is the result of the low read-depth of both test and control sample. The resulting call is, therefore, not very reliable, as both the test and control show a similar decrease in read-depth, and no clear increase or decrease can be observed from the log2 ratio.
Fig. 5. An example of a call found by DWAC-seq that does not overlap with an aCGH call (sample 4). The plots show a deletion in the test sample around 65.2 Mb, which is indicated by vertical red lines. Even though this call does not overlap with the aCGH data, it is considered to be a true deletion, due to the clear decrease in test sample read-depth as compared to the control sample. The call has been missed by the aCGH data due to resolution.
From these inspections we conclude that using the aCGH as ’standard of truth’ does not verify the results properly, because with the increased resolution of the NGS data a more motivated detection can be made. This also hints towards an overinterpretation of the aCGH data because of the lack of resolution. An unreliable CNV can be recognized by a variable log2 ratio in the region, i.e. the high variability is generally caused by a low read depth of both test and control sample which makes decision on whether the region is aberrated or not to be unreliable. In extrema, there are number of detected regions that contain a high number of ”empty” bins (i.e. bins with no reads), again causing unreliable calls. It can also happen that repetitive regions overlap with the bins,
which makes the mapping of the read in such a region unreliable and consequently the detection of aberrations. It is important to note that calls made by aCGH in these regions are also unreliable, as the repetitiveness of the region will also influence the hybridization of the DNA on the probes of the array. The visual inspections also revealed that only looking at the log2 ratio is not enough to generate a accurate conclusion about a call. The read-depth of the test and control sample provide important additional information. Taken together, the aCGH calls are not further considered to be the ’standard of truth’. An other approach is considered in section 2.8.
6
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
Normalized RD Sample
1e-6 5.0 4.0 3.0 2.0 1.0 0 4.90 1e-6
median Normalized RD Control
5.0 4.0 3.0
2.0 1.0 0 4.90
Fig. 6. An aCGH call, depicted by the vertical yellow lines, that is not found by DWAC-seq (sample 2). The plots show that a large number of bins in the plotted region is empty (indicating that no reads have been mapped to this area). The variance in the read-depth is high in both the test sample and control sample and the log2 ratio does not show a clear increase or decrease. Although the log2 ratio of the aCGH data indicates a deletion (see figure 7), also here the evidence in the aCGH is not very convincing (only few probes that also show variable signal). We conclude that in this case the NGS analysis is more reliable than the aCGH data and that this region can not be considered a CNV.
2.8 Filtering the calls generated by DWAC-seq on quality improves the reliability of the calls As the aCGH calls are inconsistent to use, a reliable ’standard of truth’ has been made. This is done by manually annotating all calls generated by both the aCGH and the NGS pipelines. Table 3 shows an overview of the results of the detections based on the NGS data when comparing to this manual annotated set of regions. The annotation of the calls results in distinguishing features between the positive and negative set: bins that remain empty because no reads can be mapped in this region; and bins with a very low coverage as compared to the average coverage of the chromosome. We used these features to develop a rule-filter that is used as a post-processing step after running the DWAC-seq analysis (see Methods for details). The results of the manual annotation and the rule filtering are displayed in table 3. Using this rule-filter, ’good’ calls are distinguished from ’bad’ calls by following two considerations: 1) homologous deletions result in complete coverage in the control sample and mostly empty bins in the test sample, and 2) good calls will generally have a low percentage of empty bins, as coverage is needed to accurately define a call as a CNV. Hence, if the coverage is 100%, it is likely that the call is correct, and when the coverage drops also the reliability of the call drops.
Fig. 7. The aCGH data for the region discussed in figure 6 (top to bottom: log2 ratio (blue); test sample signal (green); control sample signal (red) and p-value of the log2 ratio (black)). The blue line (top panel) indicates the threshold for calling a gain, the pink for a loss. There are clearly two probes with a log2 ratio below the threshold value. Comparing this figure with figure 6, we conclude that the region can not be considered as a true CNV, as the coverage is low (which could be an indication of repetitive sequences) and a large number of empty bins are observed around the region of interest.
Table 3 shows that the rule-filtering step eliminates, on average, 80% of the negative CNV set from the calls. Of the total number of calls an average of 30% is deleted, reducing the number of false positives considerably. Of the positive set some calls are deleted due to the filtering, but this is limited to 6% on average. Making the rules stricter results in more calls from the positive set that are deleted, which can be an option if the number of total calls is still too large. After the rule-filtering, the overlapping calls are merged. This merging provides a number of calls that gives a more reliable overview of what is actually found with DWAC-seq. It reduces the number of total passed calls because of small overlapping regions at the end and beginning of certain calls. We want to merge these regions because they represent one big CNV. After the filtering and merging, the resulting calls that are detected are ranging between a length of 17 kb and 33.7 Mb, which is the detection limit for the NGS pipeline when applied on the four test samples available. The actual detection limit can deviate below and above these numbers, it is possible to detect regions smaller than 17 kb and larger than 33.7 Mb, but this can not be shown with the data available. The detection limit depends on the size of the window that is used, which depends on the coverage of the control sample (in this case around 4000 bases per window when assuming uniform distribution of the reads). The lower limit of the detection lies somewhere between 4 and 17 kb, as it is not possible to detect regions smaller than one window.
7
D.M. van Beek et al
Table 3. Results of the rule-filter applied on the detected regions from the NGS data, when comparing them to the manual annotated set of calls. The manual annotation split the combined calls based on the NGS and aCGH data in three classes: positive set (actual CNVs), negative set (no CNVs, or too low coverage), and a undetermined set (hard to determine the class). The numbers in the columns ’Pos.’ (positive), ’Neg.’ (negative) and ’Und.’ (undetermined) are the detected regions that are assigned to the corresponding set. All calls that are generated by aCGH and NGS are assigned to a class. The ’Total’ row displays the total number of calls, and how these calls are manually annotated. The ’Passed’ row shows how many calls of the three sets pass the filter for this sample, the ’Deleted’ row the calls that do not pass the filter. The total number of calls after rule-filtering and merging is given in the first column.
2.9
Sample
Class
Count
Pos.
Neg.
Und.
Sample 1 Merged: 30
Total Passed Deleted
60 43 (71.7%) 17 (28.3%)
23 23 0
12 3 9
25 17 8
Sample 2 Merged: 23
Total Passed Deleted
51 30 (58.8%) 21 (41.2%)
17 12 3
10 1 9
26 17 9
Sample 3 Merged: 43
Total Passed Deleted
85 71 (83.5%) 14 (16.5%)
51 50 1
11 4 7
23 17 6
Sample 4R1 Merged: 40
Total Passed Deleted
57 43 (75.4%) 14 (24.6%)
27 25 2
9 2 7
21 16 5
Determining the minimal coverage
To determine the minimal coverage for the NGS data to still reliably detect aberrated regions we set up a simulation in which the coverage is artificially reduced before applying the detection tools. This is done using sample 3 and 4, of which sample 4 is sequenced in two different runs (sample 4R1 and sample 4R2 ). The original coverage of 3.6, 1.2 and 1.0 for all samples, respectively are reduced, by uniform sampling, in steps of two to a coverage between 0.1 and 0.45. The resulting calls are analyzed and observations can be found in table 4. The lower coverages show an increase in the total number of calls (increase in false positives). As the same control sample is used, the positioning of the windows and the window size are exactly the same for all runs. There does not seem to be a large decrease in quality of the output (the number of false negatives) for a coverage between 0.1 and 0.35, but a large increase of false positives is observed. Note that a large part of the false positives appear on the X-chromosome, which could be due to the use of a female control sample while samples 3 and 4 are male. The clinically relevant CNVs (i.e. those annotated with Type II to IV), are found for all coverages displayed in table 4, with the exception of two calls in sample 3 that were also missed in the previous experiments (see for a description of the missed CNVs the supplement). The lowest coverage tested in the previous experiment is 0.43x after mapping and pre-processing steps. Lower coverage could also be possible, the reduction experiments show that there is still a reasonable low number of false positives encountered at lower coverage. There seems to be a turning point for sample 3, sample 4R1 and sample 4R2 between 0.11-0.23, 0.17-0.35 and 0.13-0.26
8
respectively. This turning point is determined by looking at the number of calls after merging, this number is kept below 50. Concluding from this reduction experiment, the lowest coverage that can be used for the detection of CNVs with reduced NGS data lies at 0.23x.
2.10
The coverage for the control sample should be close to the coverage of the test sample
Sample 4 and the control sample are sequenced in multiple runs (see for more details figure 8), which allows to compare the results of two independent runs. Table 5, first row, shows the results and one can notice that the performance for both runs of sample 4 are more or less the same. Even when the data of the two runs is combined, the performance stays the same. We also analyzed the runs by using all measurements done with the combined control sample, AR1,2,3 (which results in an increase of the control sample coverage to 4.5x). The result is shown in the second row in table 5, which shows that the false positives increase strongly. This is probably due to the fact that the size of the window is incorrect because the window size is expressed in the number of reads aligned per window and the coverage of the control sample drastically increased. Therefore we tried two alternative settings for the window size, 350 and 700. It can be observed that the number of false positives do drop but at the same time the number of true positives also drops, so that the performance eventually is not improved when using the combined data for the control sample. Also the number of merged calls (last column in table 5) shows that combining data for the control sample does not help as the number of merged calls when using the same coverage for the control sample as for the test sample is closest to the expected number of calls (somewhere between 30 and 50, see section 2.4).
2.11
Multiplexing test and control sample
As multiple factors can influence the process of sequencing, we expect that the best results will be possible when the test sample and the control sample are sequenced in the same lane (multiplexing, for details see figure 8). The setting in which the corresponding test and control sample are multiplexed are colored red in table 6. We expect that multiplexing reduces biases that can occur due to differences in sequencing conditions such as differences in temperature or chemical composition of the genetic material. Other bias can result from GC-content. GC-rich regions are known to influence the number of reads that are sequenced in a region [Benjamini and Speed, 2012]. In table 6 for sample 4R2 the exact opposite of what we expect can be observed: a higher number of false negatives is observed than when using a control sample from another sequencing run (so no multiplexing). Sample 4R1 does show the results as we expect: low number of false negatives and relatively low number of false positives for the case where the control sample is multiplexed with the test sample. As stated before, the number of TP, FP and FN are based on aCGH analysis and is not very reliable. When looking at the results of the rule-filtering, for both samples the least data is deleted when using control sample AR2 and the lowest number of calls are passed. After merging the number of calls for sample 4R2 in combination with control sample AR2 is very low, even lower than the number of aCGH calls that are made. We conclude that the advantages of multiplexing test and control samples can not be directly derived from table 6. It seems that
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
Table 4. Detection results for sample 3 and 4 when the coverage is artificially reduced by uniform sampling. AR1 is used as control sample (coverage of 1.3x). The X: states the number of false positives that are called on the X-chromosome. The total number of calls made on the aCGH can be calculated by adding the numbers of TP and FN. There is a difference when adding the TP, FP and FN and the passed and filtered calls. This difference is explained by the fact that a TP is counted only once, even if more than one NGS call overlaps an aCGH call. Samples 1 and 2 are added for reference purposes. Sample 3
Coverage
TP
FP
FN
Passed
Filtered
Merged
No reduction Reduction 1a Reduction 1b Reduction 2a Reduction 2b Reduction 3a Reduction 3b
3.63x 0.45x 0.45x 0.23x 0.23x 0.11x 0.11x
24 24 27 23 24 23 24
40 (X: 8) 38 (X: 7) 38 (X: 9) 44 (X: 11) 56 (X: 14) 107 (X: 45) 105 (X: 45)
11 11 8 12 11 12 11
71 61 69 72 78 127 130
14 20 23 18 21 27 23
43 36 35 46 54 100 100
Sample 4R1
Coverage
TP
FP
FN
Passed
Filtered
Merged
No reduction Reduction 1a Reduction 1b Reduction 2a Reduction 2b Reduction 3a Reduction 3b
1.24x 0.35x 0.35x 0.17x 0.17x 0.09x 0.09x
17 12 12 13 11 14 15
29 (X: 7) 49 (X: 11) 50 (X: 13) 62 (X: 16) 68 (X: 18) 209 (X: 39) 194 (X: 42)
7 12 12 11 13 10 9
43 61 59 79 82 237 207
14 17 21 21 20 26 27
40 46 43 61 64 214 186
Sample 4R2
Coverage
TP
FP
FN
Passed
Filtered
Merged
No reduction Reduction 1a Reduction 1b Reduction 2a Reduction 2b Reduction 3a Reduction 3b
1.05x 0.26x 0.26x 0.13x 0.13x 0.07x 0.07x
17 16 13 12 13 14 12
26 (X: 7) 52 (X: 22) 54 (X: 28) 143 (X: 32) 127 (X: 26) 187 (X: 53) 165 (X: 42)
7 8 11 12 11 10 12
37 62 61 142 132 207 173
17 18 20 27 23 29 32
28 47 48 129 119 182 153
Sample 1
0.43x
15
25 (X: 1)
12
43
17
30
Sample 2
0.51x
16
23 (X: 0)
11
30
21
23
Table 5. The effect of the change in coverage of test and control sample. As sample 4 is sequenced in two, and the control sample is sequenced in three runs (see figure 8 for details), the data can be used separately and merged to see the effects of the changing coverage on the results. Combinations are made using the two runs from sample 4, control sample AR1 and control sample AR1,2,3 (see section S2). For control sample AR1,2,3 three different window sizes are used. TP
FP
FN
Passed
Filtered
Merged
Control AR1
Sample 4R1 Sample 4R2 Sample 4R1,2
17 17 18
29 26 34
7 7 6
43 (75.4%) 37 (68.5%) 45 (73.8%)
14 (24.6%) 17 (31.5%) 16 (26.2%)
31 28 34
Control AR1,2,3 W =100
Sample 4R1 Sample 4R2 Sample 4R1,2
22 22 22
240 271 190
2 2 2
211 (75.1%) 240 (75.2%) 177 (76.6%)
70 (24.9%) 79 (24.8%) 54 (23.4%)
161 171 131
Control AR1,2,3 W =350
Sample 4R1 Sample 4R2 Sample 4R1,2
18 18 18
63 61 54
6 6 6
60 (67.0%) 51 (59.3%) 51 (65.4%)
28 (33.0%) 35 (40.7%) 27 (34.6%)
47 36 40
Control AR1,2,3 W =700
Sample 4R1 Sample 4R2 Sample 4R1,2
12 10 13
30 27 24
12 14 11
35 (64.8%) 29 (56.9%) 30 (62.5%)
19 (35.2%) 22 (43.1%) 18 (37.5%)
27 19 22
the use of a non-multiplexed control sample can perform as good as a multiplexed sample. As the extend of the bias resulting from sequencing on different runs is not yet known in detail, more data should be generated to determine the necessity to multiplex test and reference sample.
3 3.1
DISCUSSION The control group should be genetically diverse
The control sample used in the analyses is of commercial origin and is suitable because it contains a pool of 100 test subjects, 9
D.M. van Beek et al
Table 6. True positives, false positives, false negatives, calls that passed (P) and failed (F) the rule-filtering for all combinations of the two runs of sample 4 (sample 4R1 and sample 4R2 ) and the 3 runs of the control sample (AR1 , AR2 and AR3 ) as well as taken all three runs together (AR1,2,3 using W = 350). The merged (M) column displays the calls that passed the filter after the merging step. Run 3 of the control sample has a coverage that is about twice as high as runs 1 and 2, so the window size was set at W = 200. The rows with the text colored red are the settings in which the sample 4 was multiplexed with the control sample. Control sample AR1 AR2 AR3 AR1,R2,R3
TP
FP
17 14 17 18
29 21 35 63
Sample 4R1 FN P F 7 10 7 6
43 31 44 58
which reduces the influence of individual CNVs that may occur in one of the subjects. As seen in section 2.5, the control sample contains inconsistencies, but the inconsistencies can be disclosed by visualizing the read-depth of both the test sample and the control sample along with the log2 ratio. However, it would be better to construct a new control sample. As the heritage of the test subjects in the control group is not known, it could be possible that the test subjects chosen were not diverse enough, resulting in a population bias. In a newly constructed control sample a larger variety of people should be considered.
3.2
Alignment
Allowing mismatches when aligning reads to the reference genome, will result in less data loss than when aligning in a more strict fashion and allows for SNPs to occur. The analysis itself, however, showed a small preference for this setting, but having more samples will make this conclusion more stable.
3.3
Read-tower removal
After aligning the reads to the reference genome, one of the filtering steps was to reduce the influence of extremely large stacks of reads (mostly near centromere and telomere regions), that we refer to as read-towers. We created a filter that aims to delete there towers by using the fact that the reads in the tower largely overlap (details can be found in the Methods section 5.3). Probably these towers result from the way the reference genome is constructed. Repetitive regions in the reference genome are avoided, which is why the centromere and telomere regions are not included in the reference genome. When one of the repetitive regions that is located in the centro- and telomeres is included in the reference genome this will result in the alignment of a large number of reads that are sequenced all over the centro- and telomeres, but have only one location to be mapped to.
3.4
Limitations of aCGH and NGS analysis
Both aCGH and NGS detection of CNV regions have some limitations [Holstege, 2010]. As a duplication can occur in tandem to the original but can also be located at another chromosome or position, this could be of interest. Both aCGH as the NGS pipeline will not distinguish between both types of the duplication, as it maps towards a reference genome. The NGS read data can be used to determine the orientation of the duplication, but if the orientation is the same as the original it will not offer more information than the aCGH analysis.
10
14 14 17 30
M
TP
FP
31 22 36 47
17 12 18 18
26 16 61 61
Sample 4R2 FN P F 7 12 6 6
37 28 43 52
17 14 24 34
M 28 17 29 36
An aberration that does not result in a copy number change will not be detected by both methods. An example can be a translocation, where a region is transferred to another chromosome or another location, or a inversion, where the region is inverted. As both methods look at small sequences of DNA, it does not distinguish the origin of the sequence. As DWAC-seq calculates breakpoints (changes) in coverage to determine a CNV region by using the statistics of the whole chromosome, the detection of large chromosomal aberrations will be limited (detection of whole chromosome aberrations will not be possible). If such detection is diagnostically relevant, other algorithms should be applied, for example WISECONDOR [Straver et al., 2012], which detects large (whole chromosome or subchromosomal) copy number changes.
3.5
Reliability of the proposed rule-filtering
Four samples were used to generate rules to reduce the number of false positives. Besides that the rule-filtering is not tested on an independent test sample, it is probably also constructed with not enough data to reliably state that these rules are working optimally on other test samples. More samples are needed to make these rules suitable for all patients.
3.6
Reducing coverage experiments
Simulating a reduced coverage by uniformly sampling reads does not take into account all effects that occur when sequencing a sample. The read-towers are a nice example for the fact that the uniform sampling is probably incorrect, as the height of the tower has no correlation with the sequencing depth.
3.7
Effects of changing the coverage of test and control sample
It is not easy to determine an optimum window size that can take into account the influence of the coverage of both the test sample and control sample used. The windows of DWAC-seq are set by counting the number of reads in the control sample, setting the window and afterwards counting the reads in the same window in the test sample. When the coverage of the control sample is increased, multiplying the window size by the factor of the increase in coverage is not enough. An explanation could be that there are other parameters that influence the determination of the best settings. It could be that in this specific case the threshold is not optimal for example. The difference in output can also be the cause of the increasing difference in coverage of the test sample and
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
control sample. The normalization is not good enough to point out differences between the high coverage control sample and the low coverage test sample.
3.8 Recognizing clinical calls To determine whether a detected aCGH region is relevant, an analyst investigates whether the CNV overlaps (part of) a gene or another biological relevant genomic region. If such cases, multiple literature databases are consulted to see whether there are publications available that link the gene to the symptoms of the patient. Hence, the clinical relevance of a call can not be directly stated upon detection of a region, as this relevance is not captured in the signal. Of the detected regions found by aCGH, some can be dismissed as noise, but there is not a well defined protocol as to the exact definition of this noise; it largely depends on experience of the analyst. This makes it hard to determine what makes a detected region ’good’, as it involves the interpretation by the analyst. It would be interesting to investigate the differences of reasonings between clinicians. Clearly, this also has influence on the examination of the use of NGS data to make CNV calls. We advise to set up a larger screen and include multiple independent clinical evaluations of the detected regions to come to a grounded conclusion.
3.9 Recommendations for the clinic For application of the proposed method in the clinic some factors need to be considered and precautions need to be taken. The first hypothesis that the control sample should be sequenced multiplexed with the test sample, which could eliminate temperature and GCcontent biases, still holds. It is expensive to sequence a (relatively) high coverage control sample with each test sample. In this paper we have shown that it is possible to use a previously sequenced control sample for the analysis. However, before applying this in the clinic, the differences between multiplexing the control sample and using a pre-sequenced standard control sample should be studied further. When a choice is made for a pre-sequenced standard pool, the extend of the variances that occur during sequencing should be described and kept in mind. In the experiments described a female control sample is used for test samples of male and female patients. For future applications we advise to use a male control for male patients, for the reason that analysis of the X- and Y-chromosomes will be more reliable. In the male samples used in this study, relatively more aberrations on the X-chromosome were found. The number of false positives in the X-chromosome for the male patients is expected to decrease considerably by using the male control sample. To accurately see if a CNV is originated from the test or control sample, the read-depth data as well as the log2 ratio have to be used. Only using the log2 ratio results in the illusion that a CNV occurs in the test sample, while it is situated in the control sample. Another factor that can be taken into account using all three plots, is the readdepth in the CNV region. If this is very low, the CNV is not reliable. This often results in a very variable log2 ratio. Most of these cases will be removed by the rule-filtering. The parameters of DWAC-seq are optimized using the data that was available, which results in a bias towards the four samples that were used. To reduce the effects of the bias, the determination of the best parameters should be performed again, once a standard in
coverage of the test and control sample is determined for the use in the clinic. The proposed rule-filtering was based on only four test samples, again resulting in a bias towards these specific samples. More samples should be analyzed to see if the rules are correct and if there is a possibility to fine-tune the rules. When introducing the technique in the clinic both the aCGH and NGS analyses run in parallel for some period of time. This will result in a lot of data that can be used to increase the knowledge of the exact behavior of the NGS method. Using this data, a trainingand test set can be made and the rule filter can be validated and fine-tuned. For detection of whole chromosomal aberrations additional detection strategies should be included in the NGS pipeline. A quality control system should be developed to check the output of the NGS pipeline. As shown in the results section, the optimal window size is not very stable and depends on the coverage of both test and control sample. To make sure the right parameters are used, the number of total detected regions should be low, so we can expect a low number of false positives.
4
CONCLUSION
We have shown that using shallow NGS data can be used to replace aCGH data for clinical diagnostics. We proposed a pipeline that uses DWAC-seq to detect copy number aberrations and we proposed a rule-filtering to remove unreliable calls based on the account of reads mapped to these areas. The study is a first proof-of-principle of adopting NGS for diagnostic purposes. Before application in the clinic many more samples need to be analyzed, especially to reveal the right settings (parameters and coverage) but also to optimize the rule-filtering. An important conclusion from this work is also that, even for aCGH data, the visualization of the signals for the test and control samples is extremely informative about the underlying phenomena that resulted in a detected aberration. This also showed that the gain in resolution by the NGS data with respect to the aCGH data is the greatest benefit and will eventually improve clinical diagnostics completely.
5
METHODS
5.1 Data The blood samples of the patient are prepped for sequencing using the TrustSeq protocol of Illumina [Illumina, Inc., 2011]. A single-end 50 run is performed on an Illumina HiSeq 2000 sequencer. Four test samples are used with coverage ranging from 0.2 to 2.4 (see table S2) and two control samples were available as input of the two controlbased tools. The first control sample contains a pool of 27 samples of healthy pregnant women. The second control sample is the same control sample used for the aCGH experiments. This second control sample is of commercial origin: control DNA of 100 anonymous females created by Kreatech Diagnostics [Kre, 2010]. This second control set is more reliable than the first set, as in the first set around 5% of the reads will be of fetal origin [Lo et al., 1997] and the detection of CNVs is not limited to pregnant women or their unborn child. The second control sample is sequenced multiplexed with one of the test samples. This is done as shown in figure 8. This results in multiple control samples and two runs of sample 4 (see table S2 for more details).
11
D.M. van Beek et al
SR1
SR2 AR3
AR1
AR2
R1
R2
B
R3
Fig. 8. Illustration of the way sample 4 (S) is multiplexed with the control sample (A). In two of the three runs (R1 and R2), sample 4 (red) is multiplexed with the control sample (green). In one run only the control sample is sequenced. Beside from control sample A an external control sample is used, depicted by (B). This control sample is a pool of DNA of 27 pregnant women.
5.2
Mapping
Alignment of the data is done using BWA [Li and Durbin, 2009]. The reads are aligned in two different ways; the first allowing no mismatches (’Strict’ alignment). The second way allows one mismatch per read (’Lenient’ alignment). The reference genome used for alignment is the 1000 genome Project GRCh37 reference genome [1000 Genomes Project, 2009]. This reference genome is chosen for its compatibility with BWA, as it only contains one copy of the Y-chromosome. The Ensembl reference genome assembly [Ensembl, 2012] contains the Y-chromosome in two parts, which resulted in problems running BWA and samtools. BWA places reads that can be mapped to different regions on the genome at one of those regions randomly and assigns a mapping quality (MQ) of zero to these reads. These non-unique reads are removed using samtools [Li et al., 2009] to reduce their influence on the read-depth. Samtools is also used to remove PCR duplicates to exclude the possible effect of PCR on our data analysis.
5.3
Removal of read-towers
The aligned data sets showed large stacks of reads at specific locations, especially around the centromere and telomeres, known as read-towers. The towers are made of many reads (in some samples over 10.000 reads) that overlap to a very large extend. First the towers are detected by taking together reads that have a starting position of at most N bases from each other. Towers are deleted when the number of reads in a tower exceeds a threshold Ttower . The algorithm is described in Algorithm 1. For successful removal of only towers, the setting of the parameters is crucial. Both depend on the coverage of the sample and supplement 6.4 shows a way to calculate them based on a given coverage. Here the following settings are being used: N = 4, Ttower is calculated for each sample separately as described in supplement section 6.4.
5.4
CNV detection tools
Four tools were used for detecting CNVs in the NGS data. The tools are selected in a way that both read-depth and control-based methods can be studied. CNVnator [Abyzov et al., 2011] and RDXplorer [Yoon et al., 2009] are based on read-depth analysis and only require the test sample. CNVnator requires setting a window size and does not need additional parameters. RDXplorer needs the reference genome that is used for alignment of the sample as input. Again a window size needs to be set, but only a size of 10 and 100 bases are readily available. Both tools run per chromosome and do not calculate statistics for the whole genome. DWAC-seq [Koval and Guryev, 2011] and CNV-seq [Xie and Tammi, 2009] are both based on the use of a control sample. DWAC-seq has multiple parameters that can be set. Also here a window size needs to be set, but it is important to note that the window is dynamic in terms of the number of bases as the window size is specified by the number of reads per window in
12
Data: Aligned and sorted reads and their positions Define: Minimum shift (N ) and Threshold (Ttower ) Result: Reads located in read-towers are deleted for read in sample do if chromosome current read == chromosome previous read AND N < startposition current read - startposition previous read then save reads in list previous chr = current chr end if not (chromosome current read == chromosome previous read) AND N < start position current read - start position previous read then if size of list > Ttower then delete reads in list except for the first read end end print reads in list to output end Algorithm 1: Algorithm to remove read-towers that appear after the reads are aligned to the reference genome (see supplement figure S1 for example).
the control sample. The other parameters that need to be set is a threshold and a confidence level that is used in finding the breakpoints. The breakpoints are validated by bootstrapping over a sliding window and allowing breakpoints only if the confidence level is met. This is the only tool that filters on uniqueness of a read-placement in the alignment (this will not be used, as the data is preprocessed to only contain uniquely aligned reads). CNV-seq needs no additional parameters. Some parameters can be changed, such as the threshold for calling, p-value and the window size. The algorithm tries to determine the window size for each sample by using the log2 -threshold and the p-value. All four tools differ in complexity, user friendliness and processing of the data (see table S3 for all settings used). Some need additional preprocessing of the input data and some adaptation of the code. Other tools that were considered (cn.MOPS [Klambauer et al., 2012], CNAnorm [Gusnanto et al., 2012] and BIC-seq [Xi et al., 2011]) were developed with another goal in mind (mostly detecting copy number changes in tumor material), and we were either not able to get them running in the right way, or they gave results that were uninterpretable.
5.5 Determining correctness of results The copy number changes (calls) found in the aCGH data are considered ’standard of truth’ and are used to classify the NGS calls as either true positives, TP (NGS calls overlap with aCGH calls), false positives, FP (NGS calls do not overlap aCGH calls), false negatives (for a region that is called by aCGH no NGS call is found), true negatives, TN (both aCGH and NGS do not call a region). Determining the overlap between detected regions is not trivial. Therefore four different methods are considered to determine the right class. These methods are illustrated in figure 9. The four methods are discussed in more detail in the supplement. To find the best settings for DWAC-seq, the region-based nucleotide method is used. This method takes the fact that regions detected by NGS are often smaller and may overlap one aCGH detected region into account. This method determined the best parameters settings by analyzing its Receiver Operating Curve (ROC, see 6.6 for details). The correctness of the regions detected by DWAC-seq is determined by using the region-based method, because it could easily be compared to the detected regions made by both the aCGH and NGS pipelines.
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
1) Nucleotide-based method TP
TN FN
FP
TN
FN
TN FP FP TP TN
TP
TN FP
TP
TN
TP
FN TN
FN
TN
FN
TP
FP
aCGH NGS
2) Region-based nucleotide method TP
TN
FP
TN
FN
TN
TP
TN
TP if: Overlap > 50%
TP
FP
aCGH NGS
3) Gene-based method TP
TN
FP
TN FN
TN TP FP
No gene: Not considered
FN TP FP
• If test and control sample do not contain any empty bins, assign to positive call set.
aCGH
• If both test and control sample contain empty bins:
NGS Gene track
4) Region-based method 1x TP
1x TN
1x FP
1x 1x TN FN
1x 1x TN TP
1x TN
1x TP
1x TN
1x TP if: 1x Overlap > TN 50%
1x 1x TP 1x FN FP
aCGH NGS
Fig. 9. Four method that are used to determine whether NGS calls are right or wrong for each position of the genome. Top panel: the nucleotide-based method. This method considers each nucleotide separately and labels each position with the resulting class. Panel in 2nd row: region-based method. In this method, the class is determined for the longest possible stretch of overlapping calls. The overlap should be larger than 50% of the smallest call, otherwise the nucleotide-based method is used. Panel in the 3rd row: gene-based method. This method is similar to the nucleotide-based method, but only considers the part of the genome that lies in genomic regions. The RefSeq genetrack from UCSC is used to determine genomic regions (downloaded July 30, 2012). Bottom panel: region-based method. This method scores regions by counting how many CNV regions overlap. It does not score per nucleotide, but it scores by a call. This means that when adding all false and true positives and negatives it will not result in the total number of nucleotides in the genome.
5.6
Rule-filtering
The detected regions based on the NGS data with DWAC-seq are filtered to determine which calls are reliable and which are not and thus are probably not true copy number changes. By filtering on the reliability of the calls (i.e. only keeping the reliable calls) the number of false positives is limited. The filtering is based on a set of rules that are derived from manually annotating the aCGH and NGS calls by looking at the signals of the test sample and control sample. The combined set of aCGH and NGS calls is divided into three classes: a positive set (true CNV), a negative set3 (false CNV), and an undetermined set (uncertain if true or false CNV). The undetermined set contains calls that are hard to classify into the positive or negative set because of differences in signal (for example one part shows an increase in signal, the rest of the call has a very variable log2 ratio because of low coverage). 3
We looked at several statistics for each class to find distinguishing properties. These statistics included, amongst others: the read-depth of the test sample and control sample as compared to the average signal of the chromosome and the whole genome; the variance of the log2 ratio is calculated and compared to the average variance of the whole chromosome and genome, and the variance of the read-depth signal. An important statistic turned out to be the number of ”empty bins”. An empty bin is a bin (part of the genome, in our case of 1000 bases) that contains no mapped reads and can be a measure for unmappable regions due to repetitive sequences. Regions in the negative class appeared to have a lot of empty bins. In the positive set there are also CNVs with empty bins, but these CNVs are homozygous deletions or duplications, having a decent control sample coverage, but almost no coverage in the sample. The rule-fulter for the NGS calls then becomes:
Note that this negative set is slightly different from negatives when comparing aCGH and NGS calls. Here one of the methods has called this region (positive), but after manual annotation it was decided that there is not enough evidence to call this region.
If the total percentage of empty bins in both test and control sample is less than 20%, assign to positive call set. Otherwise the call is assigned to negative call set.
• If test OR control sample contains empty bins, but the other sample contains no empty bins: If the percentage of coverage of the control sample is higher than 50%, assign to positive call set. Otherwise the call is assigned to the negative call set.
5.7 Merging of calls Sometimes if a large aberration is present, DWAC-seq detects this aberration into multiple calls that slightly overlap. After application of the rule filter, the calls that are overlapping are merged to create a more reliable view of the true number of calls that are made by the CNV detection tool. In the Results section the effect of the merging is often displayed in a separate ’merged’ column, to see the difference between the calls that pass the filter and the actual number of calls after merging.
ACKNOWLEDGEMENT We would like to acknowledge Desiree Steenbeek for assistance with all data requests regarding aCGH, Quinten Waisfisz for assistance with the generation of the NGS data, Daoud Sie and Bauke Ylstra for the inspiration ([Smeets et al., 2011]) and Roy Straver for the collaboration on the development of the RETRO filter and inspiring brain-storm sessions.
REFERENCES 1000 Genomes Project. Sanger ftp 1000 genomes reference, 10th October 2009 2009. Alexej Abyzov, Alexander E. Urban, Michael Snyder, and Mark Gerstein. Cnvnator: An approach to discover, genotype, and characterize typical and atypical cnvs from family and population genome sequencing. Genome Research, 21:974–984, 2011. Agilent Technologies. Agilent scan control 8.5.1, a. Agilent Technologies. Feature extraction 11.0, b. Jeffrey A. Bailey, Zhiping Gu, Royden A. Clark, Knut Reindert, Rhea V. Samonte, Stuart Schwartz, Mark D. Adams, Eugene W. Myers, Peter W. Li, and Evan E. Eichler. Recent segmental duplications in the human genome. Science, 297(5583): 1003–1007, August 2002 2002. Yval Benjamini and Terence P. Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Research, February 2012. Peter J. Campbell, Philip J. Stephens, Erin D. Pleasance, Sarah O’Meara, Heng Li, Thomas Santarius, Lucy A. Stebbings, Sarah Edkins, Claire Hardy, Jon W. Teague, Andrew Menzies, Ian Goodhead, Daniel J. Turner, Christopher M. Clee, Michael A. Quail, Antony Cox, Clive Brown, Richard Durbin, Matthew E. Hurles, Paul A.W. Edwards, Graham R. Bignell, Michael R. Stratton, and P. Andrew Futreal.
13
D.M. van Beek et al
Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genetics, 40(6):722–729, 2008. CERN. Root package. URL root.cern.ch. K. Darvishi. Application of nexus copy number software for cnv detection and analysis. Current Protocols in Human Genetics, April 2010. Ensembl. Ensembl human genome sequence ftp, January 2012. ENZO Life Sciences, Inc. Product data sheet enz-42671 cgh labeling kit for oligo arrays. Lars Feuk, Andrew R. Carson, and Stephen W. Scherer. Structural variation in the human genome. Nature Reviews Genetics, 7:85–97, February 2006. Arief Gusnanto, Henry M. Wood, Yudi Pawitan, Pamela Rabbitts, and Stefano Berri. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics, 28 (1):40–47, 2012. Henne Holstege. Gebruik van array in prenatale diagnostiek in nederland 2010. 2010. John Hunter, Darren Dale, and Michael Droettboom. Matplotlib. Illumina, Inc. De novo assembly using illumina reads, 2010. URL http://www.illumina.com/Documents/products/technotes/ technote_denovo_assembly_ecoli.pdf. Illumina, Inc. Truseq(TM ) rna and dna sample preparation kits v2, April 2011. URL http://www.illumina.com/documents//products/ datasheets/datasheet_truseq_sample_prep_kits.pdf. G¨unter Klambauer, Karin Swarzbauer, Andreas Mayr, Djork-Arn´e Clevert, Andreas Mitterecker, Ulrich Bodenhofer, and Sepp Hochreiter. cn.mops: Mixture of poissons for discovering copy number variations in next genration sequencing data. Nucleic Acids Research, 40, 2012. Jan O. Korbel, Alexander E. Urban, Jason P. Affourtit, Brian Godwin, Fabian Grubert, Jan Fredrik Simons, Philip M. Kim, Dean Palejev, Nicholas J. Carriero, Lei Du, Bruce E. Taillon, Zhoutao Chen, Andrea Tanzer, A.C. Eugenia Saunders, Jianxiang Chi, Fengtang Yang, Nigel P. Carter, Matthew E. Hurles, Sherman M. Weissman, Timothy T. Harkins, Mark B. Herstein, Michael Egholm, and Michael Snyder. Paired-end mapping reveals extensive structural variation in the human genome. Science, 328:420 – 426, October 2007. Slavik Koval and Victor Guryev. Dwac-seq dynamic window approach for cnv detection using sequencing tag density, February 10, 2011 2011. User Guide Megapool Reference DNA EA-100M (male) & EA-100F (female). Kreatech Diagnostics, Vlierweg 20, 1032 LG Amsterdam, version 1.0 edition, March 2010. Hugo Y.K. Lam, Michael J. Clark, Rui Chen, Rong Chen, Georges Natsoulis, Maeve O’Huallachain, Frederick E. Dewey, Lukas Habegger, Euan A. Ashley, Mark B. Gerstein, Atul J. Butte, Hanlee P. Ji, and Michael Snyder. Performance comparison of whole-genome sequencing platforms. Nature Biotechnology, 30:78–82, June 2012. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map (sam) format and samtools. Bioinformatics, 25:2078 – 9, 2009. Heng Li and Richard Durbin. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics, 25(14):1754 – 1760, 2009. Y.M. Lo, N. Cobetta, P.F. Chamberlain, V. Rai, I.L. Sargent, C.W.G. Redman, and J.S. Wainscoat. Presence of fetal dna in maternal plasma and serum. The Lancet, 350 (9076):485 – 487, 1997. Daniel Pinkel, Richard Segraves, Damir Sudar, Steven Clark, Ian Poole, David Kowbel, Colin Collins, Wen-Lin Kuo, Chira Chen, Ye Zhai, Shanaz H. Dairkee, Britt-marie Ljung, Joe W. Gray, and Donna G. Albertson. High resolution analysis of dna copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20:207 – 211, October 1998. James T. Robinson, Helga Thorvaldsd´ottir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, and Jill P. Mesirov. Integrative genomics viewer. Nature Biotechnology, 29:24 – 26, 2011. Serge J. Smeets, Ulrike Harjes, Wessel N. van Wieringen, Daoud Sie, Ruud H. Brakenhoff, Gerrit A. Meijer, and Bauke Ylstra. To dna or not to dna? that is the question, when it comes to molecular subtyping for the clinic! Clinical Cancer Research, 17:4959–4964, June 2011. R. Straver, H. Holstege, E.A. Sistermans, C.B.M. Oudejans, and M.J.T. Reinders. Detection of fetal copy number aberrations by shallow sequencing of maternal blood samples. Unpublished, 2012. Lisenka E.L.M. Vissers, Bert B.A. de Vries, and Joris A. Veltman. Genomic microarrays in mental retardation: from copy number variation to gene, from research to diagnosis. Journal of Medical Genetics, 47:289–297, 2009. Ruibin Xi, Angela G. Hadjipanayis, Lovelace J. Luquette, Tae-Min Kim, Eunjung Lee, Jianhua Zhang, Mark D. Johnson, Donna M. Muzny, David A. Wheeler, Richard A. Gibbs, Raju Kucherlapati, and Peter J. Park. Copy number variation
14
detection in whole-genome sequencing data using the bayesian information criterion. Proceedings of the National Academy of Sciences, 108(46):E1128–E1136, 2011. Chao Xie and Martti T. Tammi. Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 10(80), 2009. Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, Kenny Ye, and Jonathan Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19:1586–1592, 2009.
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
6
SUPPLEMENT
6.1 Data generation on aCGH All data used in this research were blood samples. The samples were analyzed using 4x180k Agilent microarrays following VU medical center’s standard operating procedure (SOP). Concentration of DNA is verified before analysis. Labeling is done using ENZO DNA labeling kit [ENZO Life Sciences, Inc.]. After preparation, the array is scanned using ’Agilent Scan Control’ [Agilent Technologies, a] and ’Feature Extraction’ [Agilent Technologies, b].
6.2 NGS data generation The samples were sequenced on an Illumina HiSeq 2000, and prepared using the TrustSeq Protocol of Illumina [Illumina, Inc., 2011]. The reads were filtered on Illumina chastity [Illumina, Inc., 2010] to ensure high read quality. The chastity filter looks at the cluster of the read on the flow cell and calculates a score per base (the ratio of the intensity of the highest signal divided by the sum of the two greatest signals). It is a measure that indicates how well the cluster is recognizable from the surrounding clusters. If there has been interference, the read will not pass the filter. For all samples, the read length was the same (51 base pairs). When coverage is mentioned, the following equation is used: Coverage =
Number of reads ⇤ Length of read Size of genome
Unless stated otherwise, the coverage is calculated after quality filtering, PCR duplicate and tower removal. The size of the genome is determined by counting all the A, T, C and G bases in the chromosomes of the 1000 genome project reference genome GRCh37 (hg19) resulting in 3.095.677.409 bases. The samples contain certain CNVs that are of a special interests (clinical calls) due to potential pathogenic properties and are described in table S1. All samples were obtained in the clinic. The patients had symptoms of intellectual disability and the array results were generated to find potential phenotype related CNVs.
6.3 Alignment Alignment of the reads is done using BWA [Li and Durbin, 2009] version 0.6.1. Two settings were used: conservative alignment, which allows no mismatches and less stringent alignment, which allows one mismatch per read. One mismatch allows for mapping of reads containing SNPs. The reference genome used for alignment is obtained from the 1000 genome project; GRCh37 found on the Sanger FTP [1000 Genomes Project, 2009]. PCR duplicates can be observed as a side-effect of the construction of the library for sequencing. The duplicates are the result of the amplification bias that is introduced when using the Polymerase Chain Reaction technique, but it can also be the result of simply sequencing the same region twice. Due to the uncertainty of its origin, the exact duplicate reads were removed using Samtools rmdup [Li et al., 2009] version 0.1.18. BWA aligns reads that can be mapped to more than one location on the genome to one of these locations randomly. It gives such a read a mapping quality (MQ) of 0. As the origin of these reads are unknown and their random mapping does not provide additional information for our purpose, these non-unique reads are deleted.
This is done using Samtools, making use of the MQ of 0. The resulting BAM-files contains only uniquely mapped reads. The coverage after mapping and filtering for all test and control samples are listed in table S2. The Integrated Genome Viewer [Robinson et al., 2011] is used to check the alignment and read-depth between the described postprocessing steps.
6.4
Read-tower removal
After the removal of the PCR duplicates and non-unique reads, the formation of several peaks in the read-depth can be observed (figure S1). This phenomenon, referred to as read-towers, are stacks of reads that largely overlap and are situated mostly near centromeres and telomeres, but can also be observed at other locations in several chromosomes. The read-tower forming strongly influences the statistics of the sample; they may contain more than 10.000 reads, which influences the normalization of the rest of the chromosome. The centro- and telomeres contain large repeat-regions. These repeat-regions can not be assembled correctly and are not included in the reference genome. A possible explanation of the illunderstood phenomenon of read-tower formation may be that the reference genome still contains a part of a repeat-region. This region is sequenced several times, but the reads have only one spot to be aligned to, the stack at this location becomes very high. The positioning of the read-towers is very variable over different samples, so an approach had to be developed to reduce the influence of the towers based on the statistics of the data. 6.4.1 RETRO filter The REad-Tower RemOval filter, RETRO, is developed to reduce the influence of the read-towers on the data. As the samples are shallow sequenced and have a low coverage, the filter applied can make use of the two distinguishing properties of the read-towers; the reads mostly overlap and there are a lot of reads assembled in a small region (the height of the tower). The pseudo-code of the filtering can be found in Algorithm 1 (in main text). In words: reads with a starting position within N base pairs after the current read are added to the current stack of read in the tower. This step is repeated with the next read until a read is encountered with a starting point distance larger than N. The stack that is created after this process is kept if the number of reads in the stack is below the threshold T value, otherwise all reads but the first are deleted from the data set. 6.4.2 Optimizing parameters To determine the optimal settings for the RETRO-filter, first the chance that a tower could occur in a random data set is calculated. We assume is that the reads are spread uniformly over the genome. To determine the chance for any read to be within N base pairs from the next read, the distance between the current en next read is modeled by using the Poisson distribution. The expected distance between any two successive reads can be calculated by dividing the length of the genome by the total number of reads. (expected value) can then be described as in Equation S1. =
Read count Genome length
(S1)
15
D.M. van Beek et al
Table S1. Overview of the CNVs of interest (clinical calls) for the clinic that are located in the available samples. The column ’Detected’ states whether the region is detected by DWAC-seq with the use of the best parameters for the sample. One CNV in sample 3 was found in 7 separate calls using aCGH and manually merged.
Location
Type
Length
# Probes
Probe Median
Gene
Detected
Sample 1
22q12.3
Deletion
4
-0.9748
LARGE
Yes
Sample 2
3q12.3
Duplication
⇠60 kb
4
0.5125
ZPLD1
Yes
Sample 3
5p15.1-p13.3 5p13.2 5p13.2 15q24.3
Triplication Duplication Duplication Deletion
752 (7 reg.) 5 11 12
0.9771 0.4226 0.4518 -0.9625
Multiple C1QTNF3-AMACR No genes SCAPER
Yes No No Yes
Sample 4
4p16.1 5q15-q23.3 7p14.1 Xp11.22 Xq21.2
Duplication Deletion Deletion Duplication Homozygous Deletion
⇠12.9 Mb ⇠78 kb ⇠148 kb ⇠182 kb
6 1960 15 6 4
0.6687 -0.9252 -0.9128 0.8931 -3.9688
SORCS2 Multiple C7orf10 Multiple DACH2
Yes Yes Yes Yes Yes
⇠65 kb
⇠110 kb ⇠34 Mb ⇠238 kb ⇠69 kb ⇠70 kb
Table S2. Coverage (number of times the genome is sequenced) of the samples right after alignment (’Aligned’) and after quality filtering, PCR duplicate and tower removal (’Filtered’). Two different alignment settings are used (0/1 MM). Control A is sequenced on three different lanes, in two of these lanes multiplexed with sample 4 (see 8 for details). Control B is an external control set created from data of the VUmc project to detect aberrations of fetal DNA in the blood of the mother, and contains data of all healthy samples from two runs (27 samples in total).
Aligned (0MM)
Aligned (1MM)
Filtered (0MM)
Filtered (1MM)
Sample 1 Sample 2 Sample 3 Sample 4R1 Sample 4R2 Sample 4R1,2
0.352 0.418 3.191 1.392 -
0.482 0.572 4.097 1.559 1.190 2.749
0.318 0.378 2.804 1.243 -
0.431 0.510 3.634 1.374 1.049 2.402
Control AR1 Control AR2 Control AR3 Control AR1,2,3 Control B
1.550 2.885
1.449 1.037 2.743 5.229 3.205
1.164 2.498
1.282 0.919 2.397 4.489 2.567
The chance for any read to have the next read starting within N base pairs after the starting point of the current read is described by Equation S2.
P (t
T) = (
N X i=0
P (n N ) =
N X i=0
i
⇤e i!
(S2)
The n describes the actual distance between any two successive reads and N is the parameter set in advance that indicates the maximum distance between the starting points of the reads, which is used to create stacks as described above. The chance that a stack occurs with a height (the number of reads in the stack) of T or higher can be found in Equation S3. T ) = P (n N )T
1
(S3)
In the above equation T is the set threshold that is used for deleting a read-tower from the data, t is the current stack height. -1 is introduced to correct for the fact that the calculation starts with a current read and thus always has a height of 1. Combining all of the above results in Equation S4.
16
⇤e i!
T
)
1
(S4)
Using Equation S3 the expected number of times a read-stack is removed that was formed randomly instead of being part of an actual read-tower can be approximated for each combination of N and T. As it is preferred to only remove read-towers (and not stacks of reads that may occur randomly), this expected number of removals of randomly created towers should be as close to zero as possible. The chance to have zero reads as part of a read tower should be as great as possible. Equations S5 and S6 describe this: E = P (t
P (t
i
T) ⇤ R
(S5)
T )R
(S6)
With E as the expected number of towers found in the sample containing R reads. C = P (t
With C as the chance for a read tower to occur anywhere in the sample containing R reads. Settings should be picked keeping E
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
Fig. S1. The formation of read-towers can be clearly observed in IGV. The reads that are stacked contain a large overlap and are mainly positioned around the centromere and telomere regions. The lower part of the figure shows the individual reads that occur in this read-tower. The colors indicate that the read is aligned to the reference genome (depicted as the straight grey line from left to right) allowing one mismatch (so the base on the read is different than the base on the reference genome). Above the line that represents the reference genome, the read-depth signal is depicted. This signal is cut by IGV because of the very large numbers of reads aligned in this region.
between 0 and 1, making the chance that a random set of reads is removed very low while keeping the script sensitive enough to remove abnormal behavior.
6.5 CNV detection Detection of CNVs is done using several NGS tools, each of them discussed below in more detail. Most tools take the sample BAMfile [Li et al., 2009] as input. Some tools also need a BAM-file of the control sample that is used to determine CNVs. The parameters that can be tuned for each tool can be found in table S3. 6.5.1 CNVnator CNVnator [Abyzov et al., 2011] version 0.2.5 uses the read-depth of the test sample to find CNVs in the data. The tool is dependent on the ROOT package ([CERN] version 5.32/01) and Samtools. Using an intermediary .root file the tools analyzes the data and returns a list of CNVs. The tool divides the data into samesize bins and counts the reads in each bin, creating a read-depth signal. This signal is then corrected for bias introduced by GCcontent. Partitioning of this signal is performed using a mean-shift technique (described in [Abyzov et al., 2011]). Adjacent regions are merged if the signal is similar. A CNV is called if the segment’s mean read-depth signal deviates more than 25% from the genomic average. 6.5.2 RDXplorer RDXplorer [Yoon et al., 2009] version 2.0, release 3 is run at a window size of 100 bases. The tool uses these windows to assemble a read-depth signal by counting the mapped reads in each window. Only two window sizes can be chosen; 10 and
100 bases. The testing in the paper is performed using high coverage data (30x) and they assume that at this coverage the reads positioned in a window approaches a normal distribution. The application of the tool on low coverage data still assumes this distribution., which might cause misbehavior of RDXplorer. For each window a GCcontent correction is applied using r˜i = ri ⇤ mm , where ri is GC the read count in the window i, m the median of the read count over all windows and mGC the median of the windows with similar GC-content as window i. This corrected read-depth is then analyzed using event-wise testing. The read count of a window is converted into a Z-score ( r˜i µ ) and the upper- and lower-tail probabilities are determined. A window is called by looking at a set of consecutive windows. For each window in this set the upper- and lower-tail probabilities P R 1/l that lie below ( FL/l ) are listed, the maximum of the resulting probabilities is selected as an ’unusual event’. The F P R can be set in advance, L is the number of windows in the chromosome and l the number of windows in the consecutive set. The size of the consecutive set is increased in multiple iteration steps. Some filtering steps are performed to get rid of small deletions and duplications before returning the results. 6.5.3 DWAC-seq DWAC-seq [Koval and Guryev, 2011] version 0.56 is used to detect CNVs in the sample data using a control sample. The input data should be in BAM-file format. As the tool works for each chromosome separately, a script is made to analyze the whole genome. The output contains 5 different files; the file with detected CNVs is sorted by relevance. The start and stop position of the CNV, as well as the number of reads in the test and control sample is stated. The score given to each CNV is the ratio of test and control counts divided by the median of the average reads per segment in between breakpoints. The window size is dynamic and determined by a constant number of reads per window. Using static windows has the disadvantage that the same window is used for low- and highcovered regions. The use of a dynamic window reduces the influence of GC-content as it automatically increases the window when encountering low coverage regions. The windows are set by counting the number of reads in the control sample. After setting the window, the number of reads in the corresponding window in the test sample are counted. The next step is the segmentation of the data; this is done using the CUSUM MonteCarlo method. This method find breakpoints by bootstrapping: by taking random values of the read-depths of a sliding window (containing multiple windows), a cumulative sum is calculated. Each element is corrected by a weight (average ratio of the readdepth in the set of windows), and when the cumulative sum reaches a threshold value, a breakpoint is found. Regions between breakpoints with similar signal are merged and the resulting CNV-regions are fine-tuned by looking at all the relevant start and stop positions. 6.5.4 CNV-seq CNV-seq [Xie and Tammi, 2009] version 0.2-7 is selected because of its capacity to include a control sample for the analysis of the test sample. The read-depth data of both test and control sample are analyzed using a sliding window defined by a number of bases. A statistical model is used to reduce the effect of sequencing biases such as GC-content. The size of the sliding window is calculated by the tool itself on a per-sample basis. The number of reads in a specific
17
D.M. van Beek et al
window depends on the total number of reads, the length of the genome and the size of the window, and it is assumed that the number of reads are Poisson distributed. The ratio test/control per window is corrected by the total readdepth of both samples. The probability of this copy number ratio occurring randomly is calculated and the p-value is computed (see for exact procedure [Xie and Tammi, 2009]), after which the calls are given in the output file.
6.6
Finding the best parameters for DWAC-seq
To find the best parameters that result in good performance of DWAC-seq, four scoring methods were developed (figure 9). Two parameters were analyzed and optimized: the calling threshold, T and the window size, W . These parameters were varied while the other parameters were set at default and kept constant. The threshold was varied between 0.1 and 0.5, the window size from 100 to 5000 reads per window. Changing these parameters allowed to make a Receiver Operating Curve (ROC), with which the optimal setting could be derived. The ROC plots the false positive and true positive rates that are defined as follows: False positive rate =
FP FP + TN
True positive rate =
TP TP + FN
The plots in figures S2 and S3 show the four different scoring measures used to score the parameter settings. These measures all have advantages and disadvantages. In theory the datapoint closest to (0,1) is considered optimal. Not all scoring methods show a ROC behavior that approaches this ideal point. The region-based method shows in both figures S2 and S3 a diagonal and not a curve. This shows that using this measure, no trade-off can be made between the number of false positives and true positives. The gene-based method is based on the idea that only CNVs in genetic reasons will be relevant in the clinic. Expected is that not a lot of CNVs lie in these regions, which is the explanation of the abrupt leveling of the curve in both plots. From these observations we conclude that these two measures seem least reliable to evaluate the CNV detection tools. Looking at figure S2, the nucleotide-based and region-based nucleotide methods both show a behavior that approaches the (0,1) point. The most optimal threshold setting seems to be 0.25. This can be concluded from both evaluation methods, although the nucleotide-based method bends more pronounced towards the (0,1) point. Experiments with changing window sizes using values of 100, 500, 1000, 2000, 3000, 4000 and 5000 (an fixing T =0.8), shows that a higher window size results in a lower the number of calls. The results of this experiment can be found in figure S3. The same observation can be made when we focus on the different evaluation methods. In this case the nucleotide-based method is very variable and difficult to use for concluding an optimum window size. In general, the window size shows very variable results, which is due to differences in coverage, but also because of the type, length and depth of the CNVs that can occur. It seems that there is no definite optimum window to determine over the samples. When looking at the best scoring window size per sample, a window size of 1000 seems to give the best results. When running the data with the best parameters T =0.25 and W =1000, the number of calls that are made is low. The numbers for
18
the samples 1 to 4 are 1, 4, 9 and 8 calls. The number of aCGH calls that are overlapping with these calls are 3, 4, 5 and 9. Hence, almost all calls are true positives, but a lot of calls are missed. Therefore, we decided to decrease the window size to 100, where more calls are generated (and the ROC performance is still relatively good). These calls are further analyzed and filtered to get a data set with probable CNVs. It should be noted that this setting is chosen using control sample AR1 . Changing the coverage of both test and control sample has a large influence on the choice of the window size.
6.7
Visualization
Visualization of the read-depths of the test sample and control sample is done using Matplotlib 1.1.0 [Hunter et al.]. The genome is divided into bins of 1000 base pairs and the number of reads in both test and control sample is counted (read is counted in the window based on its start position). The read-depths of the bins are divided by the total number of reads in the sample. For calculation of the log2 ratio signal, when an empty bin occurs (no reads are aligned in this region), the bin is skipped (and the bin is deleted from the log2 ratio signal). The resulting plot contains both the read-depth of test and control sample and the log2 ratio. Apart from this analysis, the number of empty bins over a region is also plotted. This clearly shows which regions are unmappable. The visualization of the aCGH output is done in similar way. The signals of the test and control sample (green and red fluorescence signals) is plotted. As the actual analysis of the aCGH is performed signal by looking at the log2 ratio (ratioi = log2 ( signalg,i )). Thesholds r,i for calling are also plotted at -1.25 (’big loss’), -0.3 (’loss’), 0.3 (’gain’) and 1.0 (’high gain’). We also plot the probe density, to show which regions of our calls contain (almost) no probes. If no probes are located and the NGS tool calls a region, it could be an explanation for the differences in output. For comparative purposes, the aCGH plots are displayed in figures S4 to S7.
6.8
Rule-filtering
Manual annotation of the combined calls as generated by analyzing the aCGH and NGS data resulted in three sets of CNVs; a positive, a negative, and an undetermined set. The annotation is performed by looking at the signal information derived from the test and control sample. The separation of the negative and positive set mainly lies in the coverage in the region of the CNV. If there is (almost) no coverage, the CNV can not be trusted. Based on this observation, rules were developed that could separate the two classes. These rules can be found in the main text Mainly the rule filter deletes calls that contain empty bins and have less than 20% coverage as compared to the overall coverage of the chromosome. If there are empty bins located in only one sample, but not in the other, a homologous deletion could be detected in this case. If the coverage of the control sample is above 50% of the expected coverage, this call is kept. The filter effectively deletes low coverage and unreliable calls from our set, reducing the number of false positives considerably.
6.9
Relevance of CNV calls
The tools that are used all apply different algorithms to produce calls for CNVs. The number of calls that are based on NGS data
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
Table S3. The four NGS CNV detection tools and the parameters to be set. A ’-’ means that this parameter can not be varied for this tool. ’Default’ means that the parameter could be changed, but only the default setting is used.
Setting tuned Input file Window size Calling threshold Sensitivity Size genome Alignment ref. Nucleotide-based Method
CNVnator
RDXplorer
DWAC-seq
CNV-seq
BAM File for each chromosome
BAM Default Build HG19 Alignment ref.
BAM Tuned, 100 Tuned, 0.25 Default -
Tab-sep. file Calculated Default Default Build HG19 -
Region-based Nucleotide Method
Region-based Method
Gene-based Method Sample 4 1MM Sample 4 Sample 3 50% 1MM Sample 3 Sample 1MM Sample 1MM Sample 25% Sample 50% Sample
2 1 3 3 3
Sample 2 Sample 1
Fig. S2. ROC curve for all samples with different threshold settings. The four plots at the top show the results of the different scoring measures described in figure 9. The threshold that are used are 0.1, 0.2, 0.25, 0.3, 0.4 and 0.5. From left to right in the plots, T decreases: the point in the lower left corner corresponds with a T of 0.5, the point in the upper right corner with T =0.1.
is larger than the number of calls based on the aCGH data. We have introduced several methods to compare the two sets of calls. The question that remains, however, is: Which CNVs are most important to find? For that reason we annotated the aCGH calls for their clinical relevance. The aCGH calls that are classified as ’clinically relevant’ are found using DWAC-seq, with the exception of two call from sample 3. The signals of these two detected regions can be found in figure S8. When manually re-annotating these calls, call 1 will be assigned to the negative class, due to the low coverage in both test and control sample which results in the variable log2 ratio that shows no clear duplication or deletion at this location. The log2 ratio of call 2 also does not show a clear duplication or deletion. Closer evaluation of the read-depths and log2 ratio in the region shows no aberration. From this we conclude that when evidence for a duplication or deletion is lacking in the read-depth data, the NGS CNV detection tool will not call such areas, and one might question the original call in the aCGH data.
The most important discriminative factor to see if a CNV is true, is the depth of the call. Deeper CNVs are considered more reliable. For biological interpretation of the results, other factors have to be included. One of the first steps is to exclude CNVs that are situated in regions that are known for biological variance. As CNVs are a naturally occurring phenomenon and can induce natural variance among the human population, we can develop a method to filter these from the results (for example by using a database containing these natural occurring CNVs). Another way to make sense of the call data is to look at regions that have a known biological function; a CNV situated in a gene has a higher probability of being pathogenic than a CNV in a region with unknown function. If the CNV lies in a gene of which is known to be involved in some disease phenotype, it will have an even higher probability of begin the cause of a certain phenotype. Using this information a type can be defined for each CNV that is found; Type I for calls that are not malignant; Type II for calls that are probably not harmful; Type III for calls that are probably malignant; and Type IV for calls that are pathogenic.
19
D.M. van Beek et al
Nucleotide-based Method
Region-based Nucleotide Method
Region-based Method
Gene-based Method Sample 4 1MM Sample 4 Sample 3 50% 1MM Sample 3 Sample 1MM Sample 1MM Sample 25% Sample 50% Sample
2 1 3 3 3
Sample 2 Sample 1
Fig. S3. ROC curve for all samples when applying DWAC-seq to the NGS data and for different window sizes. The four plots at the top show the results of the different scoring measures described in figure 9. Window sizes that are used are 100, 250, 500, 1000, 2000, 3000, 4000 and 5000 control sample reads per window. From left to right in the plots, W increases: the point in the lower left corner corresponds with W = 100, the point in the upper right corner with W =5000.
Fig. S4. The aCGH output corresponding with the NGS plot in figure 2 is plotted here. Top to bottom: log2 ratio (blue); test sample signal (green); control sample signal (red) and p-value of the log2 ratio (black). The blue line indicates the threshold for calling a gain, the pink for a loss. As in aCGH only 3 or more probes in a row are considered a call, this region is not shown in the aCGH output. The log ratio shows a duplication, while we know from the NGS data that a deletion has taken place in the control sample.
20
Fig. S5. The aCGH output corresponding with the NGS plot in figure 3 is plotted in this figure. See for figure explanation figure S4. The red and yellow vertical lines display the start- and stop-positions of the NGS and aCGH call respectively. Both NGS and aCGH show a deletion at this location.
Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis
Fig. S6. The aCGH output corresponding with the NGS plot in figure 4 is plotted in this figure. The log ratio of the probe data shows a clear deletion, while the NGS data shows very variable behavior in the log2 ratio due to low coverage in this region. We concluded that this call should not be considered a CNV.
Fig. S7. The aCGH output corresponding with the NGS plot in figure 5 is plotted in this figure. The log ratio shows a deletion in one probe, but this is filtered out of the aCGH results due to the minimum of 3 probes required per call. The NGS data does not consider this call as noise, but clearly indicates a deletion in the test sample.
1
2
1
2
1
2
1
2
Fig. S8. The NGS data of the two clinical calls of sample 3 that are not found by DWAC-seq (see table S1 for more details about these calls). Call 1 is located from 34,062,999 to 34,141,109; call 2 from 34,234,590 to 34,382,802. The yellow lines describe these start and stop positions derived from the aCGH.
21
Work Document Daphne van Beek Master’s Thesis Project Detection of Copy Number Variation using Shallow Whole Genome Sequencing Data to replace Array-‐Comparative Genomic Hybridization Analysis October, 2012 Computer Science, track Bioinformatics
Supervisors: Marcel Reinders (TUD) Janneke Weiss (VUmc) Erik Sistermans (VUmc
1
Table of Contents Table of Contents .......................................................................................................................................................................... 2 Copy number variation detection with array CGH ......................................................................................................... 4 Copy number variation detection using Next Generation Sequencing ................................................................. 8 Onderzoek -‐ Vraagstellingen ................................................................................................................................................. 10 Next Generation Sequencing CNV detection tools ........................................................................................................ 12 Overview Array CGH proces .................................................................................................................................................. 14 Alignment ....................................................................................................................................................................................... 16 Verbeteren van de alignment ................................................................................................................................................ 18 Control Sample ............................................................................................................................................................................ 18 BIC-‐seq pipeline .......................................................................................................................................................................... 20 Vergelijken van NGS en aCGH resultaten ......................................................................................................................... 21 DWAC-‐seq ...................................................................................................................................................................................... 23 Read-‐towers .................................................................................................................................................................................. 23 CNVnator ........................................................................................................................................................................................ 25 Verdunningsexperiment .......................................................................................................................................................... 25 Alternatieve alignment ............................................................................................................................................................. 25 CNV-‐seq ........................................................................................................................................................................................... 25 RDXplorer ...................................................................................................................................................................................... 26 Resultaten DWAC-‐seq/RDXplorer/CNVnator ................................................................................................................ 26 FP-‐rate scoring 1 ...................................................................................................................................................................... 28 TP-‐rate scoring 1 ...................................................................................................................................................................... 28 FP-‐rate scoring 2 ...................................................................................................................................................................... 28 TP-‐rate scoring2 ....................................................................................................................................................................... 28 19 ................................................................................................................................................................................................... 28 20 ................................................................................................................................................................................................... 28 KG .................................................................................................................................................................................................. 28 KGDI50 ......................................................................................................................................................................................... 28 KGDI25 ......................................................................................................................................................................................... 28 Daphne1 ...................................................................................................................................................................................... 28 19_1MM ..................................................................................................................................................................................... 28 20_1MM ..................................................................................................................................................................................... 28 KG_1MM ..................................................................................................................................................................................... 28 KGDI50_1MM ............................................................................................................................................................................ 28 Daphne1_1MM ........................................................................................................................................................................ 28 Optimal window size .............................................................................................................................................................. 29 19 ................................................................................................................................................................................................... 29 20 ................................................................................................................................................................................................... 29 KG .................................................................................................................................................................................................. 29 KGDI50 ......................................................................................................................................................................................... 29 DGDI25 ........................................................................................................................................................................................ 29 KG_1MM ..................................................................................................................................................................................... 29 KGDI50_1MM ............................................................................................................................................................................ 29 2
19_1MM ..................................................................................................................................................................................... 29 20_1MM ..................................................................................................................................................................................... 29 Visualisatie van de data ........................................................................................................................................................... 30 Comparing DWAC-‐seq calls with array CGH plots ....................................................................................................... 37 True positive: Actual CNV .................................................................................................................................................. 37 True positive: Not a CNV .................................................................................................................................................... 38 False positive: Actual CNV found by NGS and 1-‐probe array CGH ................................................................... 40 False positive: Not an CNV ................................................................................................................................................. 41 False positive: Actual CNV not found by arrayCGH ................................................................................................. 42 False positive: Actual call, but is is because of an increase in sample RD or decrease in reference RD? ............................................................................................................................................................................................... 43 False positive: Not a real CNV ........................................................................................................................................... 46 Empty bins ..................................................................................................................................................................................... 54 Probe density ................................................................................................................................................................................ 56 Filtering done on low coverage data .................................................................................................................................. 57 Array CGH data plots ................................................................................................................................................................. 57 Dilution experiments ................................................................................................................................................................ 58 Changes in coverage of test and control sample ........................................................................................................... 60 Multiplexing of the control sample with the test sample – important or not? ................................................ 60 Control sample versus control sample .............................................................................................................................. 61 References ..................................................................................................................................................................................... 62
3
Copy number variation detection with array CGH (Holstege 2010) Array-‐CGH: CNV analyse • Geen afwijkingen zonder netto copy number verandering • Duplicatie kan niet onderscheiden worden, follow up met FISH. (Janneke voegt hieraan toe: Wat bedoel je hier precies mee? Het gaat er denk ik om dat je bij een duplicatie niet weet waar het gedupliceerde stuk DNA zich in het genoom bevindt. Dit kan tandem (achterelkaar) zijn, of heel ergens anders liggen. Dit kun je met array en sequencen niet zien. Wel met FISH. Of een duplicatie tandem ligt of ergens anders heeft invloed op de interpretatie. Vaak weten we dit echter niet omdat de duplicaties meestal te klein zijn om met FISH te analyseren. ) (Erik voegt hieraan toe: Met sequencen kan er uiteindelijk wel achtergekomen worden welke orientatie de duplicatie heeft en waar deze zich bevind. Bij shallow sequencing is dit misschien moeilijk omdat er niet genoeg data is, maar in de toekomst zal deze beperking niet meer aanwezig zijn.) SNP-‐array: copy number analyse & genotypering • Detectie genetische onbalans (loss of heterozygosity: geen AB calls, alle SNPs zijn AA of BB & copy number neutral-‐LOH: geen AB calls, duplicatie wel detecteren, zie hier gelijk onder) • Duplicatie wel gedetecteerd door ‘scheve’ AB calls (75% B & 25% A bv.) • Copy number analyse: single channel, alleen de intensiteiten van het test sample worden per SNP probe gemeten en vervolgens vergeleken met referentie controles. Hapmap controle groep bv. • Homozygosity mapping: is de stretch of homozygosity wel pathogeen? • Bepaling Uniparentale Disomieen (UPD’s): chromosomenpaar afkomstig van 1 ouder, geen afwijkend aantal chromosome. Niet detecteerbaar met Array-‐CGH. • Bepaling oorsprong CNVs/UPDs • Mozaicisme (Vissers, de Vries et al. 2009) Resoluties: • Karyotypering: CNVs of > 5-‐10 Mb • CGH-‐arrays: CNVs > 5-‐10 Kb • High-‐res CGH-‐arrays: CNVs > 20 bp (van de Wiel, Brosens et al. 2009) (Tucker, Montpetit et al. 2011) Uitzoeken analyse arrays Stappenplan is als volgt: Na de analyse komen er ruwe datafiles op de computer te staan. Deze datafiles worden ingelezen in Nexus. Om dit zelf te simuleren heb ik een eigen project aangemaakt (Daphne_Test) om te kijken hoe dit werkt. Ik heb eerst Agilent array data ingelezen. De arrays hebben plaats voor 4 verschillende samples en deze heb ik alle vier ingelezen. Per raw_output.txt file zijn er 16 verschillende ‘lanes’. Er is ook een Imp_raw_output.txt: lijken geen verschillen te zijn met raw_output.txt. De Norm_Imp_raw_ouput.txt: Afwijkende waarden, genormaliseerd. De verwerking van de raw-‐genormaliseerde bestanden naar aparte .txt files per sample (dus per persoon) wordt gedaan door een R script waarin de grote raw file geparsed wordt naar de aparte bestanden. De manier waarop wordt genormaliseerd is verschillend per profiel dat gekozen is. Setting profiel geladen: “Settings Agilent 180k (inc dye flip)_20-‐09-‐2011” 4
De settings-‐profielen zijn gestandaardiseerd zodat deze altijd hetzelfde zijn. De vergeleken waarden zijn hieronder gespecificeerd en betekenen het volgende. Setting Agilent 180k Affy 6.0 CNV Affy 6.0 LOH Overige opmerkin gen Parameters -‐ Signal ratio
rProcessedSignal/ -‐ oProcessedSignal
-‐
Remove Flagged Spots -‐ Flags
IsSaturated,IsFeat -‐ NonUnifOL
-‐
Combine Replicates Within Array Mean -‐ Type
-‐
-‐
Systematic Correction – Type
None
None
None
Recenter Probes – Type
Median
Median
Median
Combine Replicates Between Arrays -‐ Type
None
Mean
Mean
Analysis -‐ Type
FASST2 Segmentation
SNPRank Segmentation
SNP-‐FASST2 Segmentation
-‐ Significance Threshold 1.0E-‐4 -‐ Max Contiguous Probe Spacing 1000 (Kbp)
1.0E-‐5 1000.0
5.0E-‐7 1000.0
-‐ Min number of probes per segment
3
10
3
-‐ High Gain -‐ Gain -‐ Loss -‐ Big Loss
1.0 0.3 -‐0.3 -‐1.25
0.6 0.3 -‐0.3 -‐1.0
0.7 0.1 -‐0.15 -‐1.1
-‐ 3:1 Sex chromosome gain -‐ 4:1 Sex chromosome gain
1.2 1.7
1.2 1.7
1.2 1.7
-‐ Homozygous Frequency Threshold
-‐
0.95
0.85
-‐ Homozygous Value Threshold -‐
0.8
0.8
-‐ Heterozygous Imbalance Threshold
-‐
0.4
0.4
-‐ Minimum LOH Length (KB)
-‐
5000
5000
-‐ Minimum SNP Probe Density (Probes/MB)
-‐
0.0
0.0
3.0
3.0
Robust Variance Sample QC 3.0 Calculation – Percent outliers to remove
5
Uiteindelijke file na parsing bevat de volgende fields: LogRatio, LogRatioError, PValueLogRatio, gProcessedSignal, rProcessedSignal, gProcessedSignal gProcessedSigError, rProcessedSigError, gMedianSignal, rMedianSignal, gBGMedianSignal, rBGMedianSignal, gBGPixSDev, rBGPixSDev, gBGSubSignal, rBGSubSignal,gBGMeanSignal, rBGMeanSignal. De waarden hierin worden dus uit de oorspronkelijke files gehaald. Deze conversie ga ik in de toekomst zien wat ik denk dat dit niet op de server van het VUmc staat (ik kan het niet vinden). Binnenkort ga ik ook kijken hoe de arrayanalyse op het lab gebeurd en kijken of ik de digitale verwerking ergens terug kan vinden. Aanstaande donderdag langs Berber (kamer 390) om data te bespreken (1 sample waarvan zowel SNP-‐ array data als whole genome sequencing data van beschikbaar is. Sequencing is minder dan 1x, maar zou wel al bruikbaar kunnen zijn om te beginnen). Begonnen met de literatuur die nog gelezen moest worden. (Wood, Belvedere et al. 2010): Doen een experiment om DNA te sequencen dat in formaline-‐fixed paraffine-‐embedded samples gefixeerd is. Dit is vaak DNA van lage kwaliteit, dit is met onze samples niet het geval. Wel doen ze een experiment waarbij de aCGH arrays vergelijken met het copy number analysis sequencing. De stappen die doorlopen worden: • Normalizing the copy number data for both sample and reference • Calculating log2 ratio of the normalized tumour:normal read counts for each window • Bioconductor DNAcopy pachage was used for calling. • Pairwise comparison of aCGH data & sequencing data by calculating Pearson correlation by calculating an average aCGH generated by copy number for every genomic window of sequence reads. 2.9 billion bp in whole human genome à 2.5 million reads of 45 bp à 0.0387 x coverage. In deze paper zijn ook mooie overzichtjes gegeven van de mean length, number and the smallest CNVs found versus the number of reads obtained. This data is simulated, so randomly reads were deleted. De coverage die getest is en de resolutie die daarbij gevonden wordt (de kleinste gevonden CNV) is: 0.04x à 15 kb 0.013 – 0.021x à 0.9 & 1.5 Mb 0.029 – 0.036 à 89 – 200 kb Om te kunnen concurreren met de aCGH arrays, moeten CNVs van 5-‐10 kb en groter in ieder geval gedetecteerd worden. Omdat de coverage van 0.036 een stuk minder is dan de data die ik krijg (0.2x coverage) zal mijn data misschien al gelijk voldoende zijn om de gewenste resolutie te bereiken. Het experiment zoals in dit paper weergegeven kan worden geprobeerd op te zetten met de data die ik krijg. Een probleem komt hier naar voren: ik denk niet dat er reference data wordt gesequenced. Het meest ideale zou zijn als de reference data hetzelfde is als de reference die gebruikt wordt voor de microarrays. Omdat ik niet weet welke reference data het VUmc gebruikt (ik geloof een grote subset van 100 mannen en 100 vrouwen) en of deze data ooit gesequenced is, moet ik hiervoor navraag doen bij Janneke. In de paper van Wood 2010 is de UCSC versie hg 19 gebruikt. 0.2x coverage à 580 miljoen basepairs, length of reads weet ik nog niet (ong. 45 bp?). Kosten voor 180k microarray bij Aglilent: 300 euro. Dit is exclusief de chemicaliën, kosten van sampling en werk uren. Normalisatie van microarrays kan op verschillende manieren gebeuren. Allereest neem je de waarden van de channels (groen en rood) en trek je de background ervan af. Na deze stap moeten de nulwaarden en negatieve waarden verwijderd worden, omdat een correctie niet mag resulteren in een negatief getal en daarnaast een nulwaarde niet gebruikt kan worden om een ratio te berekenen. De sample waarde wordt gedeeld door de referentie waarde, hieruit komt de ‘expression ratio’. Deze ‘expression ratio’ wordt vervolgens omgezet in de log2 ratio. Hieruit resulteert het volgende: als de sample en reference dezelfde
6
expressie hebben, is de log2 ratio 0. Als de reference veel kleiner is dan de sample, is de log2 ratio positief (1 geeft aan dat de expressie van het sample 2x zo groot is als dat van de reference). Een negatieve log2 ratio geeft het omgekeerde aan. Als de log2 ratio berekend is kan de normalisering toegepast worden. Allereerst de mean normalisation. Dit is simpel, de mean over alle log2 ratio’s wordt berekend voor 1 experiment en deze mean wordt van elke probe/expressie waarde afgetrokken. De median normalization is een normalisatie die over alle experimenten op het array werkt (omdat er vaak 4 vlakken op het array te vinden zijn waarop je verschillende experimenten kan laden). Deze normalisatie wordt berekend door voor elke probe de median te berekenen over de vier experimenten. Van al deze medians (elke probe heeft een median) wordt weer een median berekend, MM. Elke probe wordt vervolgens vermenigvuldigd met (MM/A), waarbij A de median is van de expressiewaarden in het experiment waar de probe zich in bevind. Voor 1 array met vier vlakken zijn dus 1 MM en 4 A-‐waarden die berekend moeten worden. Als laatste is er ook nog een standard deviation normalisatie, waarbij de genormaliseerde waarde van de expressie gelijk is aan (X-‐mu)/sigma. Als een soortgelijk systeem toegepast moet worden op sequence data, moet er ook een log2 ratio berekend worden voor de data. Hiervoor moet er een reference zijn, en zoals eerder beschreven moet hier goed over nagedacht worden. Affy 6.0 à 1.8 million polymorphic & non-‐polymorphic markers giving a resolution of 10-‐20 kb (which is about a gene). Begonnen met het uitpluizen van de microarray pathway die hier gebruikt wordt. Er worden natuurlijk meerdere arrays gebruikt. Allereerst de aCGH-‐array: Agilent 180k. Uit de scanner komen files die worden ingelezen in Feature extraction.
7
cs 2009, 10:80
http://www.biomedcentral.com/1471-2105/10/80
Copy number variation detection using Next Generation Sequencing
(Xie and Tammi 2009) zing microarray-basedRandom sampling in shotgun sequencing results in uneven coverage that may lead to observed coverage ChIP-Chip techassessed by the computation of a probability of a random eplaced by ChIP-Seq, in which the DNA occurrence, given no copy number variation. ratio’s that falsely imply CNV. quenced instead of being hybridized to an encing-based methods are also used to The random sampling in shotgun sequencing results in Average number of reads in a region depends on: -wide DNA methylation profiles, detect uneven coverage that may lead to observed coverage ratios • RNA Total n umber othat f reads sampled mosome translocations and tranfalsely imply CNV. Therefore, a statistical model is • Length o f t he g enome ng [13-20]. essential for the assessment of the probability of false pos• Length of the ritive egionratios. The average number of reads in a region of a Average number of reads à Compute a minimum window to samobtain a certain minimum uencing coverage in genome assemblies genome is dependent on the sliding total number ofsize reads confidence level of the opled, bservations. an indicator for potential CNV between the length of the genome and the length of the enome and shotgun data from another region. We use this relationship to compute a minimum problems of This is analogous toTwo a comparison sliding window size to achieve a desired minimum confitween microarray probes and single setsequence dencecoverage level of the observations. • aVariation nts. There are two major problems with • Copy number of the assembled genome may not be the right one (multiple copies à assembly is oach. Given a certain hybridization conThe mean number of reads for X and Y genomes in a sliddifficult ation efficiency varies Reads amongof microarray ing window determines the distribution of m the ratios. The Detection of SNVs by reference & query are mapped. Sliding w indow to analyze apped regions. e, given a certain alignment threshold, of for reads in aindividual window isin approximately distributed computing the number number of reads each each window, yielding ratios. There ratios are ors in combination compared with differences according to Poisson, Po( ), where mean number of to the probability of a random occurrence, given nthe o copy number variation. s may result in erroneous distribution of reads per window is , given by Because of the uneven coverage: NW Poisson distribution – Mean number of reads per lw=indow: (1) mber of probes on a microarray does not G l copy number of probe sequences in a 𝑁𝑊 e, the copy number of DNA segments in λ = of sequenced where N is the total number reads, G is the 𝐺 enome may not represent the true one. size of the genome and W is the size of the sliding win ions containing multiple copies are the and W <
by 10). at combines the advantages aCGH and dow beper computed Predicted c opy n umber r atio ( r) i n e ach s liding w indow: sequencing. We also assessed CNV dividuals (Dr. J. Craig Venter [24], Dr. N (2) r = z× Y 1]). An implementation of our method is NX t http://tiger.dbs.nus.edu.sg/CNV-seq. where z is the ratio of read counts in the window and NX scussion z = ratio of read counts iand n the indow NYware the total number of reads in the genomes X and Nx and NY = total number o f r eads in wAssuming hole genome and Y. Y respectively. an Xindependent distribution of ped a method to detect CNV by shotgun the read counts, we can obtain a probability, p of a copy V-seq. The method is If independent read count distribution à probability (p) of a copy number ratio being r or different from based on a robust number ratio being r or divergent from 1:1 ratio by a ranthat allows confidence assessment of chance. dom Dchance. For rthis wezneed the distribution of 1:1 ratio by random istribution ead purpose, count ratio is needed. umber ratios and is conceptually derived the read count ratio z. This distribution is given by Gaus ure 1). The microarray-based procedure, sian ratio distribution [25]. Theacomputation this disBest theoretical minimum window size (W) is calculated s described in of the paper. a whole genome microarray where two tribution is cumbersome, but it can be transformed to nomic fragments are hybridized. Instead another variable, t, by Geary-Hinkley transformation [26]: (Straver 2012) CNV-seq uses a sequence as a template CGH arrays werken met log2 ratios om het verschil in aanwezigheid van de reference en query aan te shotgun reads, one set from each target mY z −m X geven. t= d Y (Figure 1). The two sets of shotgun (3) s Y2 z 2 +s 2 d by sequence alignment on a template X Sequencing depth = average number of nucleotides contributing to a portion of an assembly. This means a sliding window approach to analyze 2 2 that iby t tells us how many times Xa, nsucleotide is sequenced. where ons and CNVs are detected computing X , Y and s Y are the means and the variances eads for each individual in each of the for X and Y respectively. The new variable t approximately ng ratios. These observed ratios are have a standard Gaussian distribution when the mean 8 Page 2 of 9 (page number not for citation purposes)
Nucleotide depth = number of sequences that contributed information about the nucleotide (so the number of reads that describe the nucleotide). LOESS correction: Approximate function by using low order polynomials. (Biodiscovery 2011) Preprocessing step: Arrive at measurements that represent the ratio of signals from the experiment as compared to control sample at multiple locations along the genome of interest. Preprocessing can include removal of flagged spots, background correction, normalization, and combining replicates. Log2 ratios calculated Copy number estimation: arrange the ratios along the chromosome. Thresholding and definition of gain, loss, high copy gain, high copy loss. Rank segmentation algorithm: to improve the thresholding approach stated above, multiple probes are used that are relative neighbors of each other and create a moving average value. This moving average can be used to make calls, but is sensitive to window size and outliers. Here the Rank Segmentation algorithm comes in. It has one parameter: significance threshold. After the segmentation process the genome contains small segments and each segment has a cluster value: the Median log-‐ratio value of all the probes in that region. A max contiguous probe spacing (in kbp) is specified to prevent calls being made in the areas that have no probes in regions greater than specified. SNPRank segmentation algorithm: similar to rank segmentation but also takes B-‐Allele frequency values from SNP arrays into account. Opdracht definitie tot nu toe: • Hoe diep moet worden gesequenced om de CGH array te vervangen? • Kan het sequencen gebruikt worden om loss of heterozygosity waar te nemen? Om deze opdracht te voltooien maak ik in het begin van het onderzoek gebruik van 2 pilot samples. Deze zijn … x gesequenced en kunnen ‘verdund’ worden om lagere resoluties te simuleren. Dagje meelopen met Janneke. • Meeting met de afdeling klinische genetica en artsen o 2 presentaties: Gea Beunders & Els Voorhoeve over een nieuw syndroom veroorzaakt door een deletie in het AUTS2 gen. Janneke Weiss over toevalsbevindingen welke ongeveer 2% zijn. Hier zijn een aantal patienten besproken. • Werkbespreking van de afdeling DNA en eiwit-‐diagnostiek o Hierin zijn vooral algemene zaken besproken (werkinstrucie wijzigingen, terugblik op 2011 en vooruitzichten voor 2012). • Uitleg van Desiree Steenbeek over de analyse van de arrays. Hiermee bedoel ik de analyse na het laden en het uitvoeren van een settingsprofiel op de ruwe data.
9
Onderzoek -‐ Vraagstellingen Constructie van vraagstelling voor dit onderzoek: Next generation sequencing van samples kan CGH arrays vervangen. Hoe diep moet er gesequenced worden om de volgende afwijkingen te vinden: 1) In ieder geval de afwijkingen die gevonden worden met het CGH array 2) Aantonen van LOH 3) Aantonen van genomisch haplotype (verschillende SNPs in het genoom) De punten worden van boven naar beneden afgewerkt. De data voor dit deel van het onderzoek zal bestaan uit twee pilot samples. Deze twee samples zijn gesequenced en geanalyseerd met een CGH-‐array. De array analyse kan worden gebruikt als minimaal resultaat dat moet worden bereikt met vraagstelling 1. De afwijkingen die in de CGH array gevonden zijn moeten in ieder geval ook aangetoond worden met de sequence data. De eerste stap in dit onderzoek is dus kijken hoe de CGH-‐analyse plaatsvind. De CGH-‐analyse die op de pilotsamples zijn uitgevoerd worden vervolgens bekeken. De tweede stap is het doen van een copy number analyse op de sequencing data (ergens in januari word dit experiment uitgevoerd). De resultaten van deze CNV analyse zal worden vergeleken met de CGH-‐analyse. Zodra gekeken kan worden of de resultaten uit de CGH-‐array analyse terug te vinden zijn in de sequence analyse, kan bepaald worden hoeveel data nodig is om deze resolutie te behalen. Op deze manier word gekeken met welke diepte gesequenced moet worden om hetzelfde resultaat te behalen als de array en kan er een kosten/batenanalyse uitgevoerd worden. Stap 1 van de vraagstelling is hierbij voltooid. Stap 2 wordt begonnen door te kijken naar de diepte waarop moet worden gesequenced om LOH waar te nemen. De laatste stap, stap 3, zal de meeste sequence diepte nodig hebben. Een haplotype kan snel worden bepaald door alleen te kijken naar tag-‐SNPs, maar die moeten wel zichtbaar zijn en er moet gesteld worden met welke zekerheid de assignment van een bepaald haplotype plaatsvind. Als er sequence errors plaatsvinden op exact de plek waar een tag-‐SNP zich bevind, is het resultaat fout. Occurence of loss of heterozygosity can be caused by a deletion, chromosome disjuction or a mitotic crossing-‐over. Hoofdstukken die in het eindverslag opgenomen worden: CGH-‐arrays CNV bepaling met seqcuence data Ethical, legal and social issues Molecular haplotype determination using allele-‐specific PCR and pyrosequencing technology: http://www.sciencedirect.com/science?_ob=MiamiImageURL&_cid=272501&_user=499882&_pii=S08887 54303001770&_check=y&_origin=&_coverDate=30-‐Sep-‐2003&view=c&wchp=dGLbVlt-‐ zSkWb&md5=1ebcc59037fef5af551470f02ae7a6d6/1-‐s2.0-‐S0888754303001770-‐main.pdf SOAPsnp: http://www.ncbi.nlm.nih.gov/pubmed/19420381?dopt=Abstract Frequency of SNPs in each individual is 8x10^-‐4 per nucleotide or about one SNP every 1.25 kb in the human genome. If SNP is non-‐synonymous, it can be potentially pathogenic. Haplotype is a set of single-‐nucleotide polymorphisms (SNPs) on a single chromosome of a chromosome pair that are statistically associated. It is thought that these associations and the identification of a few2 alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Locus → Specific location of a gene or DNA sequence on a chromosome
10
Allele → One of two or more forms of a gene or group of genes (genetic locus). Variation can be introduced, the change is not always observable. From “The international HapMap Project” by the international HapMap consortium, Nature 426, 789-‐796 (December 18, 2003): A haplotype is made up of a particular combination of alleles at nearby SNPs. Tag-‐SNPs can identify a whole haplotype, as the tag-‐SNPs are differentiating over all haplotype. (Figure www.nature.com/nature/journal/v426/n6968/fig_tab/nature02168_F1.html may make this concept clearer) New haplotypes can be formed by mutation or recombination of parental haplotypes. Linkage disequilibrium = frequency haplotype – (frequency allele 1 * frequency allele 2) Thesis project proposal: There are many patients in which the occurrence of mental retardation cannot yet be genetically explained. The resolution of aCGH-‐arrays is limited; currently copy number variations of 30-‐50 kb can be detected. Pathogenic CNVs are often mentioned in relation to ID, in some patients the pathogenic CNVs are smaller than the detection limit of current methods. Next-‐generation sequencing data can be used to achieve a better resolution, even when the genome is not fully sequenced. This shallow sequencing is inexpensive and has potential to replace current methods. Using NGS data three problems are defined: • What is the sequencing coverage needed to detect CNVs of the resolution that is currently reached using aCGH-‐arrays? o The resolution of the aCGH-‐arrays that are currently used is 5-‐10 kb, this should be matched. How can the results of the aCGH-‐arrays be compared to the sequencing results? o How can structural variations be determined in the sequenced data? Which corrections should be made and when are calls made? o What is the resolution of the NGS-‐based method and how does this resolution relate to the sequence depth? Is this dependent on the sliding window used in the segmentation algorithm? • What is the sequencing coverage needed to detect loss of heterozygosity? o How do you detect loss of heterozygosity in the sequencing data? o What length of LOH can still be determined at which sequence depth? • Is it possible, with shallow sequencing data, to determine the haplotype? o How do you detect haplotypes in sequencing data. Is it possible to reliably detect the tag-‐SNPs that are involved? o How many SNPs are needed to assemble a realistic overview of the haplotype? De laatste grote vraagstellingen vallen eigenlijk onder 1 noemer, SNP detectie capaciteit van het sequencen.
11
Next Generation Sequencing CNV detection tools Depth of coverage based methods, as we don't have paired-‐end data and most of the tools make use of the PE reads. (Campbell, Stephens et al. 2008): Paired-‐end sequencing (Chiang, Getz et al. 2009): Copy number ratio = # aligned reads/#reads aligned to normal sample. Nog een keer lezen want kan een nuttige toepassing zijn. (Yoon, Xuan et al. 2009): Eventwise Testing, ook een potentieel nuttige methode SEQanswers: • CNVNator from the 1000 genomes people • CNAseg – a novel framework for identification of copy number changes in cancer from second-‐ generation sequencing data. Uses paired-‐end data. Ivakhno, Royce, Cox. http://www.ncbi.nlm.nih.gov/pubmed/20966003 • Accurate and exact CNV identification from targeted high-‐throughput sequence data. Nord, Lee, King, Walsh. Nog niet kunnen lezen want link werkte niet (lag aan pdf-‐viewer linux TU). http://www.ncbi.nlm.nih.gov/pubmed/21486468 • CNVseq: manual at http://tigr.dbs.nus.edu.sg/cnv-‐seq/doc/manual.pdf The paper states that single end data can be used, but on SEQanswers I read that this is not the ideal method to use for cancer data, as it does not deal well with the complex changes that are seen in tumour samples. “It ignored focal amplifications within a chromosomal gain, calling it all one gain. It also didn't like homozygous deletions”. I need to find out if this is the case with my samples as well. • As reference genome: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp • SVDetect? • CNANorm? • GC content correcting is important in the process of using sequence data for CNV detection. (zit al in sommige tools verwerkt, een losse tool is volgens SEQanswers FREEC) LOESS style correction is ook geen overbodige luxe. • DNAcopy from bioconductor for tumour samples • Segseq? • http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics%20application?Biological_dom ain=Copy_number_estimation ^ lijst of meerdere programma's die ik nog kan bekijken. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next generation sequence data. Gusnauto, Wood, Pawitan, Rabbitts, Berri: http://www.ncbi.nlm.nih.gov/pubmed/22039209?dopt=Abstract Welke van de volgende programma’s kunnen gebruikt worden? De eisen hierbij zijn dat de input sequence data single-‐read is. Er is een correctie nodig voor CG content (vb. FREEC) en een smoothing (LOESS style?). Tool Single-‐end input? CG-‐content Smooth Overige opmerkingen Year correctie? ing? CNVNator
Yes, but paired end is also possible
CNAseg
I think it is possible, but their Yes quality score is based on paired end Single-‐end ?
CNVseq
Yes
Yes
Feb 2011
Yes LOESS
Okt 2010
Yes
Aug 2008
12
Nord/Lee/King Paired combined with single-‐ end SVDetect Mapped paired-‐end reads & chromosomes size data of the reference genome used for the alignment CNANorm Yes Yes
2011
Feb 2010
Yes
Segseq (Chiang) Yes
Between one and eight Okt million aligning reads 2011 (compared to approximately half a billion reads for full coverage) are enough to provide genome-‐ wide CNA at 50 kb resolution. Jan 2009
CopySeq Golden Helix
Yes, but paired-‐end mapping is Yes also possible
Yes
Commercial
Nov 2010
(Xi, Hadjipanayis et al. 2011): gebruiken een algoritme dat BIC-‐seq heet. Hiermee hebben ze CNVs gevonden met de kleinste van 40 bp met een coverage van 10x. Gesteld wordt dat segseq en CNV-‐seq goed opgebouwd zijn omdat ze een correctie uitvoeren door middel van een referentiegenoom te gebruiken, maar de fout ingaan wanneer ze aannemen dat reads uniform verdeeld zijn over het genoom. De reads zullen niet uniform verdeeld zijn om de volgende redenen: GC-‐ rijk DNA wordt eerder gesequenced dan GC-‐arm DNA. Ook kan er een scheve verdeling zijn wanneer de DNA-‐bank wordt gemaakt. Als laatste heb je reads die op meerdere plekken gemapt kunnen worden en dus verkeerd terecht kunnen komen. Aangezien dit algoritme een hoge resolutie heeft en ontwikkeld is in 2011, is dit een goede methode om te volgen. CNANorm en CNANator staan ook op de lijst van tools die ik graag ga onderzoeken. (Alkan, Coe et al. 2011): Mooie review over de aCGH-‐ en SNP-‐arrays. Als ik deze in het begin gevonden had, zou mijn informatieverzameling minder versnipperd zijn geweest. Hieruit komt weer de methode van Yoon en CNVNator naar voren. Het lijstje waaruit ik moet kiezen is nu als volgt: • CNANorm: (Gusnanto, Wood et al. 2012) • CNVNator: (Abyzov, Urban et al. 2011) • (Xi, Hadjipanayis et al. 2011) • (Yoon, Xuan et al. 2009) • cn.mops: (Klambauer, Swarzbauer et al. 2012) cn.MOPS staat in een paper die ik vandaag van Marcel heb ontvangen. Volgens deze paper is de cn.MOPS methode beter dan de methode van Yoon. De technische specificaties van de tools doorgenomen om een keuze te maken over welke tool het eerst te proberen. Aangezien CNVnorm goede referenties op SEQanswers heeft en bovendien de beste documentatie beschikbaar heeft, ga ik hiermee beginnen. Ik zag dat ik voor de meeste programma’s de output van SOAP moet omzetten in BAM of SAM files, dus ik heb SAMtools geinstalleerd. CNVnorm en cn.MOPS zijn beiden gebaseerd op R, BIC-‐seq is een Perl script en deze programma’s zijn dus het makkelijkst te runnen. CNAnator en DWAC-‐seq hebben andere programma’s nodig, en deze kan ik ook installeren indien nodig. 13
Op dit moment heb ik BIC-‐seq werkend, deze tool komt uit 2011. Omdat er ook twee tools op het lijstje staan die nog recenter zijn (2012) had ik deze graag willen bekijken. Het probleem dat hierbij komt kijken is het volgende. De output van soap is een tekst file (in mijn geval .soap geheten). Deze tekstfile kan met behulp van soap2sam.pl, een scriptje te downloaden bij samtools, omgezet worden in een samfile. Deze samfile moet daarna overgezet worden naar een bam file. Deze bamfile wordt als input gebruikt van zowel cn.mops als CNAnator. Deze bam-‐file wordt wel gegenereerd, maar er gaat behoorlijk wat mis. Er blijkt een foute header te worden gemaakt. Bovendien kan ik dus deze conversie niet uitvoeren op de UI en het uitvoeren op m’n mac duurt te lang. BWA is een assembler die vaak in de literatuur in combinatie met SOAP genoemd wordt. Voor deze assembler wordt gekozen omdat de standaard output een samfile is. Dit is dus voor mijn doel misschien ook makkelijker. Ik wilde in ieder geval kijken of ik hiermee wel de benodigde bamfiles kan creëren. Cn.mops heb ik geinstalleerd gekregen, helaas werkt dit programma niet op de data die ik heb, de coverage is te laag. Het programma vindt – ook met varierende settings – geen enkele CNV in de data. DWAC-‐seq, CNVnator en RDXplorer zijn in ieder geval geselecteerd om de verdere analyses mee te doen. Andere tools blijf ik testen. Ik heb ook een tool van het NKI gekregen. Veel van de analyses moet ik op mijn eigen laptop doen, omdat op de UI verouderde versies van software staan. Ik krijg sommige dingen wel gecompiled, anderen krijg ik helaas niet aan de praat. Veel tools zijn gebouwd op oudere softwarepacks, waardoor ik soms delen van de code aan moet passen zodat de juiste in en output wordt gegeven. Meeting met Marcel: • Maak een overzicht van de twee pipelines die je gaat gebruiken: microarray + sequencing • Kijk naar paraffine en frozen microarray experimenten en hoe ze deze vergelijken • Kijk naar de microarray en sequencing vergelijkingen experimenten en zorg ervoor dat de corresponderende paper even rondgestuurd wordt naar Janneke, Erik en Marcel. Overview Array CGH proces Scanner stuurt gescande data naar ‘Feature Extraction Agilent 10.5’. Hiervan heb ik de manual doorgenomen en gekeken naar de algoritmen die uitgevoerd worden. Ook de Nexus manual heb ik doorgenomen. In het kort is dit hetgene wat ik gevonden heb: • De output file van Feature Extraction bestaat uit drie tabellen: FEPARAMS, STATS en FEATURES. • FEPARAMS: settings die gebruikt zijn om Feature Extraction te runnen • STATS: statistieken die voortkomen uit de berekeningen op de data (vb. Mean, SD, etc.) • FEATURES: resultaten van de berekeningen op elke feature • Ik heb alle velden bekeken, de informatie voorziening is heel breed Er zijn een aantal algoritmen die door Feature Extraction gebruikt worden om van de ruwe scandata te komen tot de output file. De ruwe scandata is een foto van de array, hiervan zijn er twee. De stappen die doorlopen worden staan hieronder weergegeven. 1. Het grid wordt geplaatst om alle locaties van probes te kunnen definieren. 2. Iteratief wordt het grid aangepast zodat er een perfecte fit wordt gemaakt (gekeken wordt hierbij naar de spots in de hoeken). 3. De spot-‐centroids worden gelocaliseerd, dit zijn de middelpunten van de spots. 4. Features worden gedefinieerd door middel van de CookieCutter (plaatst elipse met een standaard radius op de centroid) of WholeSpot method (doet een spot size calculation waarmee gebruik gemaakt wordt van een noise model). 5. De radius van de background is bepaald. 6. Outlier pixels worden gedefinieerd en verwijderd.
14
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
21. 22. 23. 24. 25. 26. 27. 28.
De mean signal van een spot wordt berekend. De mean signal van de local background wordt berekend. Bepalen of een feature gedefinieerd moet worden als ‘verzadigd’. Dit wordt gemeten met een threshold. Bepalen of de feature een non-‐uniformity outlier is (als de gemeten variantie groter is dan het product van de geschatte variantie en de confidence interval multiplier). Bepalen of de feature een populatie outlier is. Trek de background van het feature signaal af. Bereken de achtergrond spatial shape om de surface te bepalen. Adjust background. Bereken statistieken aan de hand van de negatieve controles. Bereken de error in de berekening van het signaal (zowel background subtracted als detrended signal/spatial shape). Kijk of het signaal significant is ten opzichte van het background signaal (max p-‐value is vooraf door user ingegeven). Controle of background-‐subtracted signal zich genoeg boven de background bevind. Surrogate signal is calculated (kan gebruikt worden als lowest limit of detection). Multiplicative detrending (algoritme om variatie over de slide tegen te gaan, variatie kan ontstaan door bijvoorbeeld een afwijking in reactie tijd aan de randen van de array, ‘dome effect’) toepassen. Correctie van dye bias: determine normalization features. Correctie van dye bias: calculate normalization. Determine if surrogate feature must substitute low intensity signals. Calculate dye normalized signal. Calculate processed signal (vooral afhankelijk van of de surrogate gebruikt is of niet). Calculate log-‐ratio van feature. Calculate p-‐value en error van de log-‐ratio van de feature. Gridding tests (QC)
Nexus gebruikt de output van deze stappen als input. Hierbij worden de settings gebruikt die weergegeven zijn in de tabel eerder weergegeven in dit werkdocument bij 10 januari. Het uitgangspunt bij Nexus zijn de verschillende settings die worden gebruikt per sample. Deze settings bepalen wat er precies met de data gebeurd (settings zijn te vinden in een bovenstaande tabel). Deze settings zeggen het volgende (voor de Agilent 180k arrays): • De spots die in Feature Extraction geflagd worden als ‘IsSaturated’ of ‘IsFeatNonUnifOL’ worden niet meegenomen in verdere bepalingen, deze worden uit de dataset verwijderd. ‘IsSaturated’: a feature is saturated if 50% of the pixels in a feature are above the saturation threshold. ‘IsFeatNonUnifOL’: states if a feature is a nonuniformity outlier or not. A feature is nonuniform if the pixel noise of the feature exeeds a threshold established for a ‘uniform’ feature. • Recenter probes: moves the center of probe distribution to zero. This is a step that is performed with Agilent 180k data. • Combine replicates within array: zegt iets over hoe replicates op het array behandeld worden. In het geval van Agilent 180k data wordt de mean over deze probes genomen. • Combine replicates between array: zegt iets over hoe replicates behandeld worden als er meerdere arrays gerund worden. In het geval van Agilent 180k wordt dit niet uitgevoerd. • Robust variance sample QC calculation – percent outliers to remove: this setting removes the 1.5% probe data that is highest and lowest. 3 is standard. • Het segmentatie algorithme dat gebruikt wordt is FASST2. • Threshold value voor een High Gain: 1 • Threshold value voor een Gain: 0.3 • Threshold value voor een Loss: -‐0.3 • Threshold value voor een Big Loss: -‐1.25
15
• • • •
•
•
3:1 Sex chromosome gain: values die boven deze grens liggen worden als 3:1 sex chromosome gain gezien: 1.2 4:1 Sex chromosome gain: values die boven deze grens liggen worden als 4:1 sex chromosome gain gezien: 1.7 Significance threshold: bepaalt hoe sensitief het segmentatie-‐algoritme werkt. Dit is bij Agilent E 180k 1.0 -‐4 Maximum contiguous Probe Spacing: Definieert de maximum space tussen probes voordat een segment gebroken wordt. Dit is handig zodat er in regio’s waar geen probes zijn geen call gemaakt wordt (bv. Centromere). Standaard op 1000, in dit geval ook. Min number of probes per segment: probes die nodig zijn om een nieuw segment te vormen. Dit kan voor tumor samples ongeveer 5 zijn om kleine CNVs te negeren, maar voor normale samples is 3 het beste. In dit geval is ook 3 aangehouden. Verder zijn de standaard settings voor de background correction dat er geen correction gedaan wordt. De normalization staat standaard ingesteld op ‘Divide by mean’.
• Er is een account voor mij aangemaakt op de Delftse UI om de uiteindelijke data op te zetten. De tutorial heb ik doorgelezen. Gezocht naar papers waarbij microarray data wordt vergeleken. Helaas nog niet veel kunnen vinden.
Alignment SOAP2 heb ik doorgenomen, gekeken hoe ik dit moet runnen. Verder heb ik wat outputdata van SOAP2 en ik heb gekeken hoe ik hier wijs uit wordt. Ik heb gekeken hoeveel reads er gesampled zijn en daarna met SOAP2 een assembly gemaakt. Hierbij staan de instellingen zo, dat er geen mismatches mogen worden gemaakt. De resultaten heb ik hieronder in een tabel gezet. De opmerkingen bij deze tabel: • Laan 5 is de laan waar naast mijn twee samples ook nog een andere sample op zat. • Laan 6, 7 en 8 werden alleen voor mijn twee samples gebruikt. • De data van laan 5 en 6, 7 en 8 heb ik in eerste instantie los van elkaar met soap geassembled om te kijken of de kwaliteit van de slechte lanes ook invloed hadden op de assembly. Hierna heb ik de reads die geassembled zijn en de unmapped reads voor elk sample samengevoegd. • In dit eerste experiment heb ik niet gelet op de filtering die door illumina wordt uitgevoerd (chastity >= 0.6, chastity = the ratio of the highest of the four (base type) intensity to the sum of highest two. Dit filter zorgt ervoor dat duidelijk wordt welke clusters zo dicht bij elkaar liggen dat het moeilijk te zeggen is welk signaal afkomstig is van welk cluster http://illumina.ucr.edu/ht/documentation/file-‐formats). Deze filtering moet ik misschien nog wel uitvoeren, dit ga ik nog verder uitzoeken. • Grootte van de gehele reference is gebruikt om de coverage voor en na de assembly uit te rekenen: 3095677412 basepairs. Sample ID Total number Number of Number of Percentage of Coverage of reads mapped reads unmapped reads reads mapped before/after Sample 19, l 5 33496678 26193128 7303550 78.20% 0.5518/0.4315 Sample 19, l6,7,8 6397323 4625781 1771542 72.31% 0.1054/0.0762 Sample 19, total 39894001 30818909 9075092 77.25% 0.6572/0.5077 Sample 20, l 5 38177902 29862485 8315417 78.22% 0.6290/0.4920 Sample 20, l6,7,8 9041462 6642420 2399042 73.47% 0.1490/0.1094 Sample 20, total 47219364 36504905 10714459 77.31% 0.7779/0.6014
16
Ik heb alle reads gefilterd door alleen de reads te houden met de tag ‘N’ in de ID field (chastity filtering van Illumina). Deze files heb ik opnieuw door SOAP2 laten assembleren. De resultaten zijn hieronder weergegeven.
De resultaten van de SOAP2 assembly is hieronder weergegeven.
Zoals te zien is in de tabel hier direct boven, is de coverage door de filtering naar beneden gaat. Ik had gehoopt dat de slechte reads van een dusdanige lage kwaliteit waren, dat ze sowieso niet gemapt zouden worden. Doordat de coverage (en trouwens ook de percentage mapped) lager is dan eerst, blijkt dit niet het geval te zijn. De filtering is dus goed om uit te voeren, al zorgt dit er wel voor dat er een grotere coverage nodig is. De vraag blijft natuurlijk of dit ook bijdraagt bij een betere CNV detectie. Vanwege het probleem met de omzetting van SOAP2 output naar goede .sam file, ben ik geswitched naar een andere aligner: Ik maak gebruik van de assembler BWA, deze wordt door Daoud op de VU gebruikt. BWA heeft samtools in $PATH nodig, maar tot nu toe lijkt de indexering van het reference genome goed te gaan. Ik hoop dat ik samtools niet nodig heb, maar dit moet nog blijken. Uiteindelijk bleek het probleem niet te liggen aan de aligner, maar aan de gebruikte referentie. Deze heb ik dan ook aangepast. BWA is echter zo gemakkelijk te gebruiken en ook goed gedocumenteerd, dat ik besloten heb deze vast te houden. Samtools runt eindelijk goed en BWA ziet er goed uit. Het duurde even voordat ik de juiste instellingen gevonden had. Helaas had ik niet erg veel aan de error messages die terugkwamen, maar dat is het risico dat je loopt als je tools gebruikt die in dit field ontwikkeld zijn, niet heel veel mensen gebruiken ze. De samfiles die uit BWA komen hebben een complete header en niet een zelfgemaakte, ik gok dat deze wel door de tools geaccepteerd zullen worden. Dit zal ik zodra de run klaar is testen. Omdat er zoveel data verdwijnt bij het alignen, is hiervoor een oplossing gezocht. Omdat bij de strenge alignment die ik doe geen afwijkingen van de referentie zijn toegestaan, zal er nooit over SNPs heen gemapt worden. Aangezien er veel SNPs zich in het genoom bevinden, is dit wel belangrijk. Ik heb alle samples om deze reden nog een keer door de pipeline gehaald, dit keer met 1 toegestane mismatch. Dit zorgt voor een toename van 20-‐30% van de reads. Vandaag de finishing touches op de 1 mismatch data gedaan. Eerder had ik een kwaliteit van minstens 20 als eis gesteld. Dit zorgt er alleen voor dat alleen mismatches worden toegestaan op het eind van de read. 17
Aangezien SNPs natuurlijk ook gewoon in het begin van de read kunnen voorkomen, heb ik de kwaliteitseis aangepast naar 1. Alle q’s van 0 worden dan weggegooit, de rest wordt gehouden. Dit verwijdert de reads die op meerdere locaties gemapt kunnen worden. De reads met lage score kunnen echter, behalve SNPs, ook sequence errors bevatten. Dit moet in het achterhoofd gehouden worden. Dit is een van de belangrijkste redenen dat we zijn gaan kijken naar een alternatieve manier van alignment. Tot die tijd blijf ik met deze data werken. Ik heb ook een begin gemaakt met het schrijven van een strakkere materials & methods. Verbeteren van de alignment Voor de detectie van copy number variations op lage coverage, is het belangrijk dat je erg zeker bent over de locatie waar de reads gealignd zijn. Dit is de reden waarom we gekozen hebben voor geen en 1 mismatch toestaan. 1 mismatch kan een SNP zijn, maar natuurlijk verhoogt het ook de kans dat de read gewoon verkeerd gesequenced is. Vanwege de lage coverage is het niet (voor sommige regios wel) altijd mogelijk om vast te stellen of het over een fout of SNP gaat; dit wordt meestal gedaan door een 3 reads die de locatie overlappen te nemen, en de base met de hoogste count is degene die op deze locatie ligt. Het komt in onze data niet vaak voor dat er 3 reads met een gemismatchte base overlappen. Wat we dus voornamelijk willen, is het alignen van de reads op het referentie genoom maar wel de SNPs toelaten. Er zijn ondertussen vele databases die alle bekende plaatsen van SNPs opgenomen hebben. We (Roy en ik) hebben een manier bedacht om dit te bereiken. Hiervoor veranderen we niet zozeer de tool voor het alignen zelf, maar de referentie die gebruikt word. We voegen aan deze referentie een stuk toe van 100 basenparen voor elke SNP. Hierin veranderen we de SNP locatie in alle mogelijke basenparen die kunnen voorkomen. Als SNPs vlak naast elkaar liggen (een read kan er twee overspannen), dan zullen we daar ook rekening mee houden. Het belangrijkste is dat het alignen van een read op deze locaties uniek blijft, dus alle mogelijke combinaties moeten accuraat worden meegenomen. Als deze referentie dan vervolgens gebruikt wordt om data mee te alignen, worden vele reads op deze duizenden stukjes van 100 bp gealignd. Aangezien we precies weten waar deze kleine stukjes op het genoom horen, kunnen we gemakkelijk de startposities van de reads aanpassen zodat we een dataset krijgen met reads die gealignd zijn en daarbij alle mogelijke SNPs toestaan. We hebben een deel van de code hiervoor geschreven, maar zijn niet toegekomen om dit verder te testen. We hopen dit nog ergens in de toekomst te kunnen doen; misschien hebben we nog tijd over aan het eind van ons master’s thesis project. Control Sample In dit hoofdstuk worden twee verschillende referenties genoemd; een referentie genoom die gebruikt word voor de alignment van de NGS reads. De ander bestaat net als het sample uit reads en wordt gebruikt als ‘normal’ in verschillende tools. Gesprek met Marcel. Hierbij de Proposal besproken en de referenties. De referenties moeten nog een keer besproken worden met Janneke, want hier heb ik nog wat vragen over. Ik weet bijvoorbeeld dat de Agilent 180k array probes opgebouwd zijn aan de hand van de reference hg18, terwijl hg19 de nieuwste update van het humane genoom op NCBI is. Het plan is om een assembly te maken op hg19, maar wat is de bias die daarbij komt kijken? Kan ik dit wel gewoon doen? Ook heb ik de referentie die op de microarrays gebruikt wordt nodig. Daarbij is het een idee om te kijken of er ook coverage informatie beschikbaar is van hg19, al is dit natuurlijk een geassembleerde referentie. Dit kan betekenen dat niet alle sequencing gedaan is door Illumina, en het verwerken van data van verschillende platformen zou moeilijk kunnen zijn. Na dit gesprek met Marcel ben ik naar de VU gegaan en heb daar met hulp van Daoud de gesequencete samples meegenomen. Helaas vertelde Daoud daarbij dat er iets mis is gegaan tijdens het sequencen. De oorspronkelijke 3 laantjes met de twee gemultiplexte samples hadden een te grote dichtheid in de laan,
18
waardoor detectie erg moeilijk was. Er was echter ook nog wat overtollig sample aangebracht in een laan waar wel goede resultaten vanaf kwamen. Aan de hoeveelheden data te zien lijkt het erop dat er meer data van deze laan afkomt dan van de drie volledige laantjes samen. De referentiedata heb ik gekregen van Daoud, die was aanwezig op de promotie van W. Meuleman. Deze data heb ik overgezet naar de TU en staat klaar voor assembly. De reference data is afgelopen vrijdag verkregen. Deze data heb ik ingekort, aangezien de reads een lengte hadden van 101 basenparen. De data is nu verkort tot 51 reads. De originele reference files bevatten geen informatie over de kwaliteit van de reads in termen van de chastity filtering. Wel zijn de string quality velden aanwezig, hier kan ik me eventueel nog even op richten. De verkortte reads heb ik vervolgens geassembled met SOAP2. De resultaten zijn weergegeven in onderstaande tabel.
In de vraag over welke reference te gebruiken is naar voren gekomen dat tegenwoordig arrays gebruikt worden waar de probes gebaseerd zijn op hg19. Dit is echter niet het geval met mijn samples dus het probleem blijft bestaan. Ik heb nog kort gekeken of er een verschil is in mapping als er niet met twee losse samples wordt gemapt maar dat eerst de gefilterde reads bij elkaar worden genomen en dan in 1 keer gemapt worden. Hierbij resulteert de XX dataset in 7920916 mapped/5232800 unmapped (60.2%). Dit is exact hetzelfde als het optellen van de twee losse files. Ook voor XY is het resultaat exact gelijk. Voor de assembly maakt het dus niet uit hoeveel reads er beschikbaar zijn. Gekeken moet worden of de sample van een vrouwelijke of mannelijke patient afkomstig is om de juiste reference te kunnen kiezen. Eerst wordt sample 19 bekeken. Beide cases zijn man. Met behulp van SAMtools heb ik de .fq files omgezet in .sam files. Bespreking met Janneke gehad. De calls die door Nexus gemaakt worden op Agilent 180k hg18 arrays kunnen omgezet worden in hg19 door middel van de UCSC browser. Janneke zegt dat vooral de nummering van de hg19 erg afwijkt van hg18, omdat er steeds meer stukken bij zijn gekomen. Dit is dus zeker van invloed als er gekeken wordt naar het vergelijken van calls. Als via de UCSC genome browser de calls goed omgezet kunnen worden, denken we niet dat het nodig is om de experimenten over te doen op hg19 arrays, mocht dit toch wel nodig blijken, is dit nog steeds een mogelijkheid. Verder stelde Janneke dat het belangrijk is dat ik mijn data krijg, dus daar gaat ze nog even voor in overleg met Erik. Ook de samples die ik al heb, waarbij de coverage redelijk laag is uitgevallen, zullen ze bespreken. Een vraag die op kwam is of de referentie niet een keer met een hogere coverage is gesequenced, als ik samples met hoge coverage ga analyseren heb ik wel een referentie nodig en ik moet nog even uitzoeken in hoeverre de coverage van sample en referentie aan elkaar gelijk moeten zijn voor een goede CNV detectie. Ik ga zelf nog uitzoeken of er een algemene openbare referentie ergens beschikbaar is. Ik heb een scriptje geschreven om de hg18 files om te zetten in de hg19 files.
19
De conversie van de 11DA file online ging niet soepel. Bij een matching van 100% van de sequenctie werden er 17 records niet overgezet. Daarvan zijn 7 ook echt van de eerste kolom. 3 ervan zijn van de min size column en de laatste 7 van de max size column. Ik heb de thesholding op 0.05 gezet, want dan worden alle fields overgezet. Dit is echter wel belangrijk om in het achterhoofd te houden! Er zijn zowel meldingen gemaakt dat de regio gedeeltelijk niet meer bestaat, als dat het in de nieuwe versie gesplit is of dat het gedupliceerd is in de nieuwe versie (deze zijn dus ook met lage threshold niet te vinden). Er is in de nieuwe file wel iets veranderd: de locaties werden eerst aangeduid met komma’s als duizend separator, nu is er geen separator meer. Dit leek me echter ook niet nodig. Er is nog steeds iets mis met de BAM files. Na het gesprek met Marcel, waarin naar voren kwam om ook even goed te kijken naar de coverage op de plekken die interessant waren (om te kijken of er uberhaupt coverage is), heb ik geprobeerd mijn bam files in te lezen in IGV, IGB en SeqMonk. Dit zijn alle drie viewers voor assemblies. De error kwam bij alle drie de viewers overeen: er bleek een dubbele entry voor het Y chromosome in de header te zitten. Omdat de viewers met geindexeerde files werken, accepteren ze dit niet. De dubbele entry komt doordat in de reference genome die ik gebruikt heb (hg19) het Y chromosoom opgenomen is in 2 delen. Dit is door locatie gescheiden, maar in de bam files is dit niet terug te vinden. Hierdoor gaat het ook al fout als de sam files worden gemaakt door BWA. Mijn oplossing is om te kijken naar de reference genome die door het 1000 genome project samengesteld is. Deze reference (ook hg19) zijn de chromosomen plus een aantal unieke gebieden, terwijl hg19 ook veel stukken dubbel opgenomen heeft, omdat deze stukken veel voorkomende varianten zijn. Deze reference heeft 1 Y chromosome en heeft het eerder genoemde probleem dus niet. Verder hebben Roy en ik met Henne gesproken over de filtering. Hierbij kwam naar voren dat ik misschien een deel van Roys data kan gebruiken als reference en misschien een sample pakken om mee te testen, in ieder geval tot ik meer data heb. Op deze manier kan ik misschien wel al kijken hoe CNVs eruit worden gehaald als er meer data aanwezig is. Een nieuwe referentie heb ik samengesteld uit Roys gezonde samples. Hierbij moet dus wel gerealiseerd worden dat ~5% van de data van een baby afkomstig is. De coverage is na alignment en filtering 2.5 en dus veel hoger dan eerdere samples. Ook een nieuw sample is verkregen, coverage na filtering en alignment is 2.8. Dit is veel hoger dan in het begin en kan dus gebruikt worden voor onder andere verdunningsexperimenten. BIC-‐seq pipeline Gewerkt om de BIC-‐seq pipeline op te zetten. Hiervoor heb ik heel wat dingen geprobeerd (zo waren er opties om samtools te gebruiken), maar uiteindelijk was ik gedwongen toch een eigen python scriptje hiervoor te schrijven. Een van de samples heb ik samen met de reference hiermee bewerkt zodat een juiste inputvorm voor BIC-‐seq werd gegenereerd. Nu dit zover is, is het relatief makkelijk om de input voor BIC-‐seq te genereren. De volgende stappen zijn tot nu toe gemaakt en werken:
Deze pipeline wordt doorlopen met de sample data en de reference data. Hieruit ontstaan dus twee keer 24 .seq files, die als input dienen voor BIC. Een laatste stap is het genereren van een .config file 20
(generateBICconfig.py) waarin de locatie van de files staan. BIC-‐seq heb ik vandaag ook gerund en het lijkt te werken. Morgen ga ik de .configfile nog een beetje tweaken zodat de output van BIC-‐seq op een logischere volgorde wordt gegenereerd, op dit moment is de volgorde van output niet gestructureerd en slecht herkenbaar. De calls van BICseq zonder repeatmask filtering heb ik verder bekeken. Van de 35 calls: • 26 zijn herkend als variation in DGV. • 9 zijn niet herkend • Van deze 9 niet herkende, kunnen 6 calls eventueel verklaard worden door repeat regio’s en/of dups, self chains. • 3 calls zijn dus of ‘nieuwe’ CNVs, of false positives. Concluderend uit de experimenten met BIC-‐seq, genereert het toepassen van een repeat masking veel minder calls dan zonder deze masking. Er worden meer calls gevonden, of deze calls betrouwbaarder zijn is echter niet te zeggen omdat er te weinig samples verwerkt zijn om hier een goede uitspraak over te doen. Wel is het percentage dat overlapt met de microarray calls hoger als de repeat mask wordt toegepast. Overlap met Overlap met known ‘Nieuwe’ CNVs of microarray calls CNV’s DGV false positives S19 met mask 2/7 = 28.6% 6/7 = 85.7% 1/7 = 14.3% S19 zonder mask 4/35 = 11.4% 26/35 = 74.3% 9/35 = 25.7% S20 met mask 6/20 = 30% 14/20 = 70% 6/20 = 30% S20 zonder mask 7/66 = 10.6% 48/66 = 72.7% 19/66 = 28.8% Helaas zijn de pathogene CNVs nog niet terug gevonden in de data zoals ik deze nu heb ontvangen. Vergelijken van NGS en aCGH resultaten BIC-‐seq Ik heb kwantitatief gemeten hoeveel van de calls van de microarray en NGS sequencing experimenten overeen komen. Hierbij heb ik gekeken naar de regio’s van CNV die door beide systemen worden voorspeld op de verschillende locaties. Als er een overlap is tussen twee regio’s, wordt er een percentage overlap berekend door de volgende formule: Er is gekozen om een percentage te berekenen door het gemiddelde te nemen van de regio’s. Als de kortste regio genomen wordt, zou het kunnen dat er honderd procent overlap ontstaat, terwijl dit niet kan omdat de stukken niet gelijk zijn. De grootste regio kan genomen worden, maar het kan ook voorkomen dat de CNV in werkelijkheid niet deze lengte heeft en dan dus nooit een 100% overlap gaat bereiken. Hieronder zijn daarvan de resultaten kort weergegeven. Sample 19: Number of Microarray calls: 80 Number of NGS calls: 109 Microarray NGS Match Microarray Microarra Percentage Chr. region NGS start NGS stop region number start y stop overlap length length 1
chr1
72759756 72818395
2
chr5
3
chr6
29843119 29905312
62193
29855001 31316800 1461799
6.602528097
4
chr6
32455074 32526380
71306
32102501 32586900 484399
25.66330277
741513
847231
58639
72755901 72765700
105718
790101
838800
9799
17.3704667
48699
63.07506994
21
Sample 20: Number of Microarray calls: 28 Number of NGS calls: 217 Match Microarray Microarray Microarray NGS region Percentage Chr. NGS start NGS stop number start stop region length length overlap 1
chr1 72759756 72818395
2
chr1 72759756 72818395
58639
3
chr1 149004461 149259523
255062
4
chr4 49289703 49581867
292164
49287401 49294200
5
chr5
60859
6
chr6 32440447 32492390
51943
32276301 32676700
400399
22.9662512
7
chr7 110967075 111243380
276305
110997901 111203600
205699
85.35157385
8
chr9 43421478 43946569
525091
43506001 43672900
166899
48.23740227
786372
847231
58639
72764501 72766300
1799
5.953208246
72766901 72811600
44699
86.51028663
149046601 149674800
628199
48.21275729
6799
3.008409095
1599
5.120240802
801101
802700
Bij sample 19 zijn er 4/80 microarray calls teruggevonden in de NGS data. Bij sample 20 is dit een stuk meer: 8/28. De score is op dit moment erg laag. Een van de hoofdredenen hiervoor kan de conversie van de calls naar hg19 zijn. De arraydata zal nogmaals geanalyseerd worden, aangezien Janneke het liever allemaal in hg19 heeft dan in hg18. Het laatste deel van vandaag zal ik besteden aan het schrijven van een pythonscript dat al mijn scriptjes koppelt zodat de pipeline in een keer uitgevoerd kan worden. De calls die BIC-‐seq gemaakt heeft op sample 19 na de algehele filtering resulteert in 7 calls. Twee van deze calls komen overeen (16 en 61% van de CNV) met een MA call. 6 van de 7 calls bevinden zich in een regio waar een bekende CNV ligt. Het ziet er dus naar uit dat BIC-‐seq wel werkt, maar dat het vinden van de CNVs van interesse misschien meer coverage nodig heeft. Ik zal nog wat gaan proberen met de settings kom te kijken of de resultaten veel veranderen. BIC-‐seq heb ik met dezelfde settings gerund op sample 19 maar zonder de repeat masking (er is dus wel duplicate removal en quality filtering uitgevoerd). Hier ontstaan 35 calls, waarvan 4 overlappen (6, 16, 54 en 33% overlap). Ik ben bezig met het controleren van de calls met de DGV om te kijken of deze calls toch in gebieden vallen waar CNVs aanwezig zijn. Van de 7 calls van het vorige experiment zijn er 6 terug te vinden in deze 35 calls (1 call is als 2 calls in de set van 35 gegeven), 1 call is niet terug te vinden. De vraag die hieruit ontstaat is dus of het wel verstandig is om de repeat mask toe te passen. Het kan zijn dat hierdoor teveel data verdwijnt. De beste manier om dit te testen is om het experiment te proberen op data met hogere coverage. Verder is gister de BIC-‐seq CNV caller gerund op de sample20 data, deze is nu beschikbaar. Weer zijn twee versies bekeken, eentje met quality filtering en removal of duplicates (aangeduid met ‘qd’), de andere (aangeduid met ‘mask’) heeft behalve deze twee eigenschappen ook de repeat filtering gehad. Mask: 20 calls, waarvan 6 overeenkomen met de microarray calls (28 totaal). De locatie van 14 van de 20 calls kwam overeen met CNV locaties in de DGV. 6 hierin zijn ‘nieuw’ of false positives. 13 van de 20 calls zijn ook als overlap terug te vinden in de QD calls. Hiervan zijn 5 ook terug te vinden in de microarray calls. 22
QD: 66 calls, waarvan 7 overeenkomen met de microarray calls (28 totaal). De locatie van 48 van de 66 calls kwam overeen met CNV locaties in de DGV. 19 hiervan zijn ‘nieuw’ of false positive. Vandaag heb ik de statistieken gerund op de output data van RDXplorer en CNVnator. CNV-‐seq staat nog niet aan (al staan de input files al wel klaar op de UI). De ROC plots zijn gemaakt. Met Marcel heb ik deze doorgesproken. We hebben besloten om niet te veel te richten op deze andere tools. Wel ga ik bekijken waarom de false positieven worden gecalld; ligt dit misschien aan hoge GC-‐ content? Ook ga ik keuzes maken en bepaalde conclusies trekken: -‐ Verdunning, wat is hieraan te zien? -‐ 1 mismatch toestaan, is dit gewenst? -‐ Keuze van threshold maken -‐ Postprocessing, zou dit uit maken? -‐ Windowsize uitleggen en verklaren, keuze maken. -‐ Keuze van tool maken Verder een aantal plots maken die veel kunnen verklaren: -‐ Log ratio scores (logratio array CGH vs. log ratio seq) van interessante regio's -‐ Van dezelfde regio's een dubbelplot met logr N en logr M boven en onder. -‐ logr van microarray op x en tool score op y (kan vergeleken worden met eerst genoemde plot om te zien wat het effect van de tool is. DWAC-‐seq Het is nu mogelijk om van alle samples goede .bam files te maken. Dit heb ik gedaan door middel van BWA en samtools. De goede .bamfiles worden gesorteerd en kunnen dan gebruikt worden als input van dwac-‐seq, een tooltje dat CNVs voorspelt door gebruik te maken van dynamische windows. Dwac-‐seq heb ik kunnen runnen. Deze tool kijkt per chromosoom naar de data, er word dus niet gecorrigeerd voor afwijkingen over het hele genoom. Omdat de tool per chromosoom werkt, heb ik een tooltje geschreven waardoor in een keer alle chromosomen geanalyseerd worden en de output in een map verschijnt. Vandaag heb ik gewerkt aan een visualisatie voor de resultaten van dwac-‐seq. Deze worden nu gemakkelijk gemaakt door MATLAB. Ook ben ik bezig geweest met het runnen van BWA op de UI met de ‘nieuwe’ reference. Het blijkt dat het gehele assembly proces veel sneller doorlopen word, de tijd komt nu wel overeen met wat er in de manual van de tool beschreven staat, eerder met de hg19 reference duurde het 4 keer zo lang. Read-‐towers De data bekeken in IGV. Dit kan nu wel! De headers zijn nu stabieler. Er zijn pieken weer te nemen vlak naast centromeren. De exacte origine van deze pieken moet nog beargumenteerd worden. De pieken die gevonden zijn moeten goed bestudeerd worden. Online heb ik verschillende verklaringen gevonden die eventueel kunnen passen bij het verschijnsel. Allereerst zou het een fout in de PCR kunnen zijn, dat er delen van het genoom in deze stap al overamplified zijn. Dit is te zien aan een blocky structuur, veel exact identieke reads op elkaar. In de data is echter te zien dat de reads af en toe een nucleotide opschuiven, dus dit is niet een verklaring. Een volgende verklaring zou kunnen zijn dat de pericentromeric region veel LINEs en SINEs bevat. In literatuur wordt dit vaak aan repeat copy numbers gekoppeld. Dit resulteert in ‘towers’. Een aantal oplossingen zijn gevonden die geprobeerd kunnen worden:
23
Filter out reads where the proportion of reads coming from exact overlaps is above a cutoff (e.g. 10%) • Samtools rmdup: removes duplicates • Remove repetitive reads on the XT tag of the BWA output (XT:A:R). This method states that is removes noise better than samtools rmdup. • Samtools view –q 30: selecting on quality. As these regions might have low quality, this could be a solution. It is stated at SEQanswers that this mathod removes peaks, but also removes the whole pericentromeric region. Het eerste experiment dat is uitgevoerd is de rmdup van samtools. Deze reduceert het aantal reads in de piek regios behoorlijk: van rond de 3000 naar rond de 100. De pieken worden op dit moment als volgt aangepakt: • Assembly van de reads naar het hg19 reference genome met behulp van BWA. • Het verwijderen van de reads die op meerdere plekken gemapt kunnen worden door een quality filtering uit te voeren (remove all reads with MQ of 0). Er zijn na deze stap geen XT:A:R tags over, omdat deze automatisch een MQ van 0 meekrijgen. • Het verwijderen van duplicates met behulp van samtools rmdup. Hierdoor worden eventuele pcr duplicates verwijderd. • Het toepassen van een repeat mask. De reads in de regio’s die in de repeat masked reference van ensembl gemaskt zijn, worden in deze stap verwijderd uit de data. De stappen zoals hierboven beschreven werken goed om een deel van de ruis te verwijderen. De laatste stap ben ik echter nog niet helemaal zeker over, aangezien het ervoor kan zorgen dat er geen CNVs meer gedetecteerd kunnen worden. Dit kan gebeuren als er veel regio’s worden weggehaald en er dus geen data meer is om voorspellingen op te kunnen doen. De coverage zoals nu overgehouden wordt is als volgt: (alle filtering toegepast / filtering maar geen repeatmasking toegepast / originele coverage na BWA mapping) -‐ Sample 19: 2267272 / 4805421 / 7744879 0.037x / 0.079x / 0.12x -‐ Sample 20: 2675994 / 5649881 / 9068120 0.044x / 0.093x / 0.15x -‐ XX reference: 624510 / 1355559 / 2449583 0.011x / 0.022x / 0.04x Natuurlijk heb ik met de CNV tools gewerkt in de afgelopen week om te kijken of de filtering effect had. Hierbij zijn nog steeds niet de regio’s naar voren gekomen die ik verwacht ten opzichte van de microarrays. Dit ligt volgens mij aan de te lage coverage. Na de filtering is er zeer weinig data over, en er kan niet veel over het genoom gezegd worden. Ik kan met deze hoeveelheid coverage niet zeker zeggen of het aan de data of aan de tools ligt dat ik geen resultaten kan genereren. De IGV tracks die nodig zijn kunnen nu wel ingelezen worden. Het bleek dat we een verkeerde te pakken hadden. De tracks die geselecteerd zijn: -‐ DGV structural variants (CNV’s, InDels uit DGV) -‐ Segmental dups (duplications) -‐ Selfchain (alignment of genome against itself) -‐ Simple repeats (tandem repeats) -‐ Interrupted repeats (part of RM) -‐ Repeat masker -‐ Microsatellite (short reps) Zie voor de resultaten met gebruik van de repeatmask het BIC-‐seq pipeline hoofdstuk. Jeroen Laros van het LUMC heeft zicht gegeven op wat zij doen met het wegfilteren van de towers. Zij berekenen een mean coverage en vermenigvuldigen dit met een factor f. Voor deze f wordt vaak 2.5 gekozen. De waarde die dan verkregen wordt is een drempelwaarde; van alle regio’s waar de coverage hoger is dan deze threshold, worden de reads verwijderd. Dit resulteert in het verwijderen van de read towers. •
24
Er is echter een side note die ik hierbij maak: de mean coverage voor de samples die ik heb (met lage coverage) zou uitkomen onder 1, omdat veel regio’s geen enkele read hebben die overlapt. Wat ik heb gedaan om dit filter toe te passen is de mean coverage berekend van de posities die wel overlap hebben. Dit bracht bij sample 19 de mean coverage op 1.2, wat een redelijke waarde is. De vraag is echter of een f van 2.5 niet te laag is. Het zou namelijk betekenen dat als een positie meer dan 3 reads ‘heeft’, dat al deze reads verwijderd worden. In dit geval zou ook veel waardevolle data aangetast worden. Het uitfilteren van de reads is niet gemakkelijk, er gaan grote bestanden mee gemoeid. Dit moet misschien nog wat efficienter gemaakt worden mochten we kiezen om dit filter toe te passen. Roy en ik hebben de krachten gebundeld om de tower orming tegen te gaan. Dit is een filtering geworden die twee eigenschappen van de towers combineerd om ze uiteindelijk weg te kunnen gooien. De eerste is dat er erg veel overlap tussen de reads is. PCR duplicates liggen exact op elkaar. Bij erg hoge coverage data is het gewenst dat reads veel overlap hebben, het is hierbij niet ongewoon om readlength-‐1 overlap te hebben. In onze lage coverage data is dit juist niet de verwachting. Het feit dat in de towers juist wel readlength -‐1 overlap is, kunnen we dit gebruiken. De filtering werkt als volgt: als er x reads zijn die readlength – n overlap hebben, worden ze gedelete. De x en n kunnen worden berekend door te kijken naar de hoeveelheid reads die overblijven bij het toepassen van de filtering. We hebben vastgesteld dat een n van 4 een goede maat lijkt. De x kan hierbij worden berekend door middel van een zelfgeschreven scriptje. CNVnator Om te testen of een CNV finder zonder referentie ook zou kunnen werken, heb ik een tool die zonder referentie werkt gerund: CNVnator. Deze tool maakt calls in de twee samples voor beide pathogene CNVs. Wel is de call lijst erg lang. Toch blijkt hieruit dat een referentie eventueel achterwege gelaten kan worden, mits goed gefilterd wordt. Verdunningsexperiment De verdunning van het hoge coverage sample heb ik ook uitgevoerd. Hierbij gaat het om een half en kwart keer. De kwart keer komt uit op ongeveer dezelfde coverage als de eerste twee samples die ik kreeg. Alternatieve alignment Omdat het toestaan van 1 mismatch overal in de read ervoor kan zorgen dat er ook reads met een sequencing error geplaatst kunnen worden, zijn we aan het nadenken gegaan om een andere manier van alignen te creeren die wat gecontroleerder werkt. Hierbij maken we extra referentiesequenties met de alternatieve SNP erin verwerkt. In 1 alignment run kunnen we ervoor zorgen dat ook de SNPs gemapt worden. Een andere manier is om de SNPs in de referentie te vervangen door Ns, dit kan alleen niet door BWA geprocessed worden. Bowtie kan dit wel, dus dit is ook een goede manier om te proberen of deze manier van alignen werkt. CNV-‐seq Vandaag heb ik de input files voor CNV-‐seq gemaakt. Deze inputfiles zijn pileups van de huidige BAM-‐files. Deze kunnen direct ingelezen worden in CNV-‐seq. Helaas kan ik de tools CNV-‐seq, cnvnator en rdxplorer alleen op mijn Mac runnen, waardoor het allemaal langer duurt dan verwacht. Verder heb ik een script gemaakt dat de output van rdxplorer met behulp van de drie scoring methoden worden behandeld. Hiervoor heb ik het script van DWAC-‐seq een beetje aan moeten passen. De andere tools kunnen waarschijnlijk ook gebruik maken van dit aangepaste script, het is nu meer algemeen en 25
minder specifiek toegepast op DWAC-‐seq. RDXplorer Dit weekend heb ik alle outputfiles van alle samples (19,20,KG, KGdi50%, zowel normaal als met mismatch) voor rdxplorer gemaakt. Dit heeft heel wat uurtjes geduurd. Ook heb ik een start gemaakt met de output van CNVnator. DWAC-‐seq is ook opnieuw aangezet voor de 1 mismatch samples, omdat de resultaten gebaseerd waren op een kwaliteitsscore van 20, en dit uiteindelijk toch niet gewenst was. Resultaten DWAC-‐seq/RDXplorer/CNVnator De afgelopen week heb ik voornamelijk oude data geanalyseerd en gekeken naar verschillende manieren om mijn data te visualiseren. De output van alle tools kunnen gemakkelijk geprint worden onder een log2 ratio (zonder verdere analyse) van zowel de referentie als de sample data. Op deze manier heb ik goed kunnen kijken naar de verschillen tussen waarom sommige calls wel gevonden worden en andere niet. Vandaag ben ik begonnen aan het maken van eenzelfde overzicht voor de microarray data, maar op een of andere manier kom ik er niet geheel achter welke data nou precies wordt gebruikt. Ik heb een thresholding netjes toegepast, maar zie niet de uitschieters in de probes die er wel horen te zijn. Ik ga goed mijn code nakijken om te zien of dit aan mij ligt, of dat er blijkbaar bij de analyse iets fout gaat. Ondertussen heb ik al heel wat figuren ter illustratie gemaakt. Hieronder heb ik een kleine selectie beschreven.
Allereerst de scoring. Ik heb drie methoden bedacht om de calls die ik met de tools vind te scoren. Dit houdt in dat ik bij elke methode een call anders behandel. Van boven naar beneden in het bovenstaande figuur houdt dit het volgende in: De eerste methode ‘Nucleotide based scoring’ kijkt naar elke nucleotide in de sequentie en bepaalt daarvan de klasse waarin deze nucleotide behoort (TP, FP, TN, FN). Het aantal ‘samples’ dat je op deze manier neemt en in klassen deelt is dus even groot als het totaal aantal nucleotiden in de gehele sequentie. De tweede methode ‘Region based scoring’ kijkt weer naar elke nucleotide om een score te maken, maar houdt zich daarbij aan een andere set rules. Aangezien de microarrays gebaseerd zijn op probes die de 26
locatie niet exact aan kunnen geven, nemen we in dit geval van de eerste start site die we vinden naar de laatste stopsite. Hierbij worden de regio’s groter en kleine NGS regios binnen grote NGS calls worden als 1 gerekend. Dit is de meest accurate manier om de werkelijkheid te scoren. De derde methode is het ‘tellen’ van de calls. Hierbij wordt voor elke TP een 1 geteld bij het totale aantal TPs. De TPs zijn hierbij klein, de TNs erg groot. Deze maat is vooral handig om te zien hoeveel van de calls nou daadwerkelijk gecalld zijn door de NGS. Deze drie score maten zijn toegepast op DWAC-‐seq output. Hierbij is de theshold waarbij de tool een call maakt gevarieerd. De ROC curve en de Chi-‐squared value zijn geplot voor elke threshold value. Hierbij moet je in het achterhoofd houden dat de datapunten als volgt geplot zijn: 0.5, 0.4, 0.3, 0.25, 0.2, 0.1. De theshold ligt om het nulpunt; 0.25 betekent dus dat alles van 0.75 tot 1.25 niet bekeken wordt om te callen. Alles dat buiten deze range ligt, wordt wel bekeken. Een bootstrapping wordt toegepast om te zien of deze call ook echt waarheidsgetrouw is.
De resulterende ROC curves die geven duidelijk aan dat er een optimale thresholding te vinden is. De derde methode laat hier duidelijk zien dat het alleen tellen van de calls niet voldoende is om een optimale theshold te vinden. Als in de figuren gekeken word, moet er een keus gemaakt worden tussen welke scoring we gaan gebruiken om de optimale thesholding vast te stellen. Hierbij is het gelukkig niet erg moeilijk: bij zowel de eerste als de tweede scoring is een theshold van 0.25 een goede keus. Eerste scoring: • False positive rate nooit boven de 0.034. (FP/(FP+TN)) • True positive rate nooit onder de 0.74. (TP/(TP+FN)) Tweede scoring: • False positive rate nooit boven de 0.034. • True positive rate nooit onder de 0.95. De false positive rates liggen dicht bij elkaar, het verschil in de scoring is vooral hoe de true positives gemeten worden (de regio’s TP zijn ook veel kleiner dan de FPs, dus daardoor is er zo veel verschil tussen de scoring). 27
Tabel met de scores op een T van 0.25:
FP-‐rate scoring 1
TP-‐rate scoring 1
FP-‐rate scoring 2
TP-‐rate scoring2
19
20
KG
0.0161
0.7816
0.0166
0.9565
KGDI50
0.0301
0.8215
0.0317
0.9640
KGDI25
0.0296
0.7652
0.0311
0.9574
Daphne1
19_1MM
0.0338
0.7723
0.0337
0.9937
20_1MM
0.0319
0.7427
0.0331
0.9892
KG_1MM
0.01191
0.8593
0.0122
0.9914
KGDI50_1MM
0.0291
0.8798
0.0278
0.9916
Daphne1_1MM
Hetzelfde experiment is uitgevoerd met het varieren van de windowsize. Hierbij is minder duidelijk een perfect window te selecteren. De optimale windowsize hangt sterk af van de coverage van het sample en op de plek van het CNV. Wel moet in het achterhoofd gehouden worden dat een te kleine windowsize een hoog aantal false positives oplevert, dit wil men voorkomen. Een ideale window ligt tussen 1000 en 2000 reads per window. Optimale windowsizes:
28
Optimal window size
FP Scoring 1
TP Scoring 1
FP Scoring 2
TP Scoring 2
19
2000
0.0061
0.6685
0.0052
0.9287
20
1000
0.0109
0.5742
0.0087
0.9855
KG
1000
0.0157
0.8248
0.0138
0.9736
KGDI50
1000
0.0178
0.7969
0.0172
0.09433
DGDI25
Between 1000-‐2000 0.0253
0.7074
0.0248
0.8967
KG_1MM
2000
0.0125
0.7604
0.0116
0.8680
KGDI50_1MM
1000
0.0111
0.7905
0.0109
0.9271
19_1MM
1000
0.0085
0.6883
0.0071
0.9375
20_1MM
1000
0.0193
0.5716
0.1888
0.9885
In deze tabel is duidelijk te zien dat een window size van 1000 vaak als beste naar voren komt. Er is geen verband te vinden in window size en coverage. 19, 20 and KGDI25 all have the same coverage, but one of these samples prefers 1000, one 2000 and one sample something in between. Another sample with a very high coverage (KG_1MM) also prefers 2000 as a window size. For further experiments I will use a windowsize of 100 as this seems to be the most representative when looking at all the samples. The number of false positives is a little bit higher than when using 1000 but the number of true positives when using 1000 is not very high as well. In this case we choose a slightly higher number of true positives and as a result deal with a higher number of false positives. A fourth scoring method is developed that is similar to the first method, but only counts nucleotides in genetic regions. The refSeq gene track is used to define the genetic regions.
29
)
! 30#
!
! De!onderstaande!figuren!geven!een!impressie!van!hoe!de!calls!er!uit!zien.!Ik!heb!zowel!voorbeelden!van!true!positives,!false!positives!en!false!negatives! gegeven!om!de!verschillen!eruit!te!lichten.!Door!middel!van!deze!figuren!kan!bekeken!worden!hoe!‘foute’!calls!voorkomen!kunnen!worden.! ) True)positives) ) Allereerst!begin!ik!met!de!calls!die!wel!gevonden!zijn.!De!beschrijving!van!een!figuur!staat!er!direct!onder.!
Visualisatie)van)de)data)
31#
) Vorige)pagina)Links:) Dit!figuur!is!de!triplicatie!op!chromosome!5!van!sample!3.!Geplot!zijn!van!boven!naar!beneden:! • Chromosome!and!the!location!of!the!CNV.!Centromere!is!red,!location!of!CNV!in!blue.! • Log2!ratio!van!het!sample!en!de!referentie.!Bins!die!in!een!van!de!samples!leeg!waren,!zijn!verwijderd! • Sample!read!depth!(genormaliseerd)! • Reference!read!depth!(genormaliseerd)! • RDXplorer!output!(normalized!depth!of!coverage)! • DWACQseq!output!(Ratio’s!corrected!by!median!ratio)! • CNVnator!output!(Black!=!Read!depth!corrected!for!GCQcontent,!red!=!applied! segmentation,!cyan!=!after!merging!regions.)! The!red!vertical!lines!indicate!the!location!that!the!microarray!gives!as!start!and! stop!site!of!the!CNV.!! ! Zoals!je!duidelijk!uit!bovenstaande!figuur!kan!halen,!wordt!in!de!NGS!data!duidelijk! een!veel!langere!CNV!teruggevonden.!In!de!MA!data!worden!er!na!deze!CNV!nog! andere!calls!gegeven,!welke!eigenlijk!bij!deze!ene!call!horen.!In!de!NGS!data!is!dit! ook!zeker!het!geval,!de!call!wordt!in!kleine!stukjes!gemaakt.!Er!is!mooi!te!zien!aan! de!log!2!ratio!dat!veel!variatie!in!het!signaal!weg!word!genomen!door!het!gebruik! van!een!reference.!! ! Dit!is!nog!een!call!van!sample!3,!chromosome!15.!Duidelijk!een!verschil!tussen! reference!en!sample,!dit!komt!ook!naar!voren!in!de!log2!ratio.!DWACQseq!en! CNVnator!vinden!deze!call,!RDXplorer!vindt!deze!call!ook!terug.! ! Vorige)pagina)Rechts:) Sample!1,!chr!22.!Dit!sample!heeft!een!flink!lagere!coverage.!Toch!is!deze!call! duidelijk!te!zien!in!de!data.!De!lengte!van!de!call!is!50kb.!De!windowsize!kan! verhoogd!worden!tot!rond!de!400,!vanaf!dan!word!deze!call!niet!meer!gevonden.!In! CNVnator!en!DWACQseq!wordt!deze!call!gevonden.!RDXplorer!geeft!op!deze!locatie! geen!call.!! ! ! Rechts:) Dit!is!sample!2,!chr!3.!Ook!weer!een!laag!coverage!sample!(hoewel!iets!hoger!dan! net).!48!kb!CNV.!De!call!wordt!gevonden!met!zowel!DWACQseq!als!CNVnator,! RDXplorer!vind!hier!helaas!niets.! ! RDXplorer!lijkt!met!de!laag!coverage!data!veel!minder!goed!om!te!kunnen!gaan!dan! de!andere!twee!tools.!! ! ) )
32#
! ! ! Links:)Dit!is!de!belangrijkste!false!negative.!Dit!is!een!van!de!CNVs!(gain)!die!ik!absoluut!moet!vinden!omdat!deze!vermoedelijk!iets!met!het!phenotype! van!de!patient!te!maken!heeft.!Zoals!je!kan!zien!is!er!een!duidelijke!reden!waarom!deze!region!niet!als!CNV!gecalld!word.!Zowel!de!reference!als!het! sample!heeft!op!deze!locatie!een!lage!coverage.!De!log2ratio!ziet!er!erg!ruizig!uit.!DWACQseq!callt!deze!CNV!niet!omdat!op!hetzelfde!chromosoom!een! triplicatie!ligt!welke!veel!hoger!ligt.!Ik!denk!ook!dat!het!goed!is!dat!deze!call!niet!gemaakt!wordt!door!de!NGScaller.!Het!ziet!er!niet!uit!als!een!standaard! CNV.!Dit!is!een!goed!voorbeeld!van!een!call!die!we!kunnen!filteren!door!te!kijken!naar!de!coverage!van!zowel!sample!als!reference!vergeleken!met!het! gemiddelde!over!het!gehele!chromosoom.!De!variatie!van!de!log2!ratio!kan!een!maat!zijn!om!de!cnv!een!bepaalde!score!mee!te!geven.!!
) False)negatives) ! De!volgende!plots!die!we!kunnen!bekijken!zijn!een!aantal!false!negatives.!We!beginnen!weer!met!figuren!van!sample!3!(hogere!coverage).!
33#
! ! Links:)Sample!3!(higher!coverage),!chr.!6.!Here!a!loss!should!be!detected.!Geen!van!de!tools!kan!deze!vinden.!!Dit!is!niet!onlogisch!als!gekeken!word! naar!de!log2ratio!en!de!profielen!van!het!sample!en!de!reference.!De!coverage!lijkt!iets!lager!te!zijn!dan!rechts!van!de!region.!Dit!is!te!verklaren!omdat! deze!region!aan!het!begin!van!een!chromosoom!ligt,!waarschijnlijk!is!de!coverage!laag!door!repetitive!behaviour!(dit!komt!regelmatig!voor!in!de!buurt! van!telomeres!en!centromeres).! ! Rechts:)Sample!3!nog!steeds,!chromosoom!8.!In!de!log2!ratio!is!deze!gain!niet!duidelijk!te!zien.!Als!er!echter!naar!de!sample!en!reference!gekeken! wordt,!is!de!sample!hier!wel!degelijk!hoger!gecovered.!Omdat!hier!al!een!hogere!coverage!standaard!in!het!sample!zit,!wordt!dit!verschil!niet!goed!door! DWACQseq!opgemerkt.!CNVnator!vindt!deze!call!wel,!omdat!deze!niet!naar!de!referentie!kijkt.!Verder!is!de!coverage!om!deze!region!erg!laag,!het!kan!
Rechts:)Sample!3!weer,!chromosoom!5.!Dit!zou!een!gain!moeten!zijn.!Eenzelfde!situatie!als!de!figuur!die!hiervoor!besproken!is.!Deze!call!wordt!gemaakt! op!de!microarray,!maar!op!de!NGS!data!is!echt!niets!te!zien.!Hier!is!zelfs!de!variatie!in!de!log2ratio!niet!waar!te!nemen.!De!coverage!is!op!deze!plek!te! laag.!!
34#
! Hier!weer!een!lage!coverage!bij!zowel!sample!als!referentie.!(er!moet!een!loss!gevonden!worden)!Hier!kan!weer!door!middel!van!de!variatie!in!de! log2ratio!de!call!gescoord!worden.!Het!vreemde!is!dat!DWACQseq!juist!links!van!de!region!een!hogere!score!geeft,!terwijl!in!de!log2ratio!hier!juist!het! signaal!smoother!is!dan!in!de!call!zelf.!DWACQseq!neemt!waarschijnlijk!het!variabele!log2!ratio!niet!als!een!call!op,!maar!omdat!links!van!de!call!ook!de! log2ratio!omhoog!gaat,!wordt!hier!juist!iets!gevonden!(DWACQseq!geeft!een!call!op!44165696!Q!44294599).!! ! Het!reduceren!van!false!negatives!is!in!de!meeste!gevallen!dus!niet!per!se!nodig.!Veel!van!de!calls!zijn!niet!aanwezig!in!de!NGS!data.!De!vraag!die!hierbij! natuurlijk!komt!kijken!is!of!het!in!de!Microarray!data!wel!goed!gecalld!wordt.!Misschien!zijn!deze!calls!in!de!microarray!data!gewoon!fout.!! Natuurlijk!is!hierbij!nog!wel!de!poolQdata!als!referentie!gebruikt,!de!microarray!referentie!zou!hier!verlichting!kunnen!brengen.!Dit!is!natuurlijk!iets!wat! getest!gaat!worden.!!
zijn!dat!het!segmentatie!algoritme!alleen!deze!region!vindt!omdat!de!coverage!eromheen!zo!afwijkt.!Toch!bewijst!de!MA!het!tegendeel,hier!wordt!deze! call!ook!gevonden.!!
35#
! Links:)Sample!3!(higher!coverage).!Hier!vindt!DWACQseq!een!deletie.!Als!je!de!sample!en!referentie!data!vergelijkt,!is!dit!wel!te!zien.!Het!signaal!van!het! sample!ligt!iets!lager!dan!de!referentie.!Wel!is!het!signaal!van!de!referentie!hier!niet!erg!stabiel,!wat!zou!kunnen!betekenen!dat!hier!toch!niets!ligt!en!het! dus!terecht!een!false!positive!is.!! ! Rechts:)Above!figure!another!false!positive!is!displayed.!There!seems!to!be!not!enough!data!at!this!location!and!a!gap!is!situated!in!most!of!the!call.!This! call!should!not!be!made!and!we!may!be!able!to!filter!this!out.!
! False)positives) Een!aantal!false!positives!heb!ik!uit!mijn!data!gevist!om!nader!te!bekijken.!! !
36#
! Another!false!positive.!This!seems!to!me!a!correct!call.!The!coverage!at!this!location!is!lower!than!in!the!reference.!! ! The!selection!for!a!tool!for!further!studying!the!calls!is!done!looking!at!the!number!of!calls!and!the!number!of!True!positives.!CNVnator!and!RDXplorer! are!only!based!on!the!sample!and!do!not!take!the!reference!into!account.!Apart!from!this!fact,!the!number!of!calls!that!are!made!is!very!large!(RDXplorer! =!more!than!100.000/CNVnator!=!4800)!in!comparison!to!the!number!of!calls!from!DWACQseq!(around!50).!The!number!of!true!positives!is!lower!using! RDXplorer!than!using!DWACQseq,!but!the!number!of!DWAC!true!positives!is!similar!to!the!number!of!CNVnator!true!positives.!The!fact!that!DWACQseq! uses!a!dynamic!window,!is!another!advantage!of!the!method!over!CNVnator.!! ! When!looking!at!the!data!when!allowing!a!mismatch,!a!comparison!can!be!made!with!the!previous!data!
! !
37#
!
Left# picture:# Log26ratio# of# the# read# depth,# Sample# read# depth,# Reference# read# depth,# DWAC6seq# output,# CNVnator# output.# Right# picture:# Microarray# data# log2# ratio# of# Red/Green,#p6value#of#this# log2# ratio,# Median# of# the# Green# signal,# Median# of# the#red#signal.# # This#call#is#a#true#positive.# The# first# noticable# thing# is# that# the# log26ratio# is# flipped# over# because# the# ratio# is# calculated# as# reference/sample# instead# of#sample/reference.## Both# graphs# agree# in# the# location#of#the#CNV.##
! The!comparisons!are!made!and!described!below.!I’ve!taken!a!couple!of!examples!to!give!a!general!overview!of!the!available!situations.!It!may!be! possible!that!the!figures!that!are!used!for!comparison!are!also!discussed!in!the!above!chapter.!The!left!plot!is!made!with!NGS!data,!the!right!with!array! CGH.!! True positive: Actual CNV !
Comparing)DWAC>seq)calls)with)array)CGH)plots)
38#
! ! ! In!the!left!figure,!the!empty!bins!are!deleted!from!the!dataset,!so!no!clear!overview!can!be!made!for!this!particular!call.!This!is!the!reason!that!I!included! the!empty!bins!in!this!plot,!see!below.!The!log2Qratio!can!only!be!set!if!the!reference!and!sample!have!data!in!the!particular!bin;!if!there!is!no!reference! data!in!the!bin,!the!log2Qratio!can!not!be!calculated!(division!by!zero).!If!there!is!no!sample!data!in!the!bin,!the!log2Qratio!will!go!to!infinity,!which!is!also! not!preferable.!The!plots!of!the!readdepth!can!give!an!overview!of!the!situation!though.!! !
True positive: Not a CNV
39#
! ! This!figure!clearly!displays!that!a!deletion!is!to!be!found!at!this!location!(no!sample!read!depth,!there!is!enough!reference!read!depth).!In!the!array!CGH! data!a!deletion!is!also!found!(flipped,!again).!!
40#
! The!left!figure!shows!a!increase!in!coverage!in!the!sample.!In!the!array!CGH!data!this!is!also!detected,!but!as!a!call!is!only!made!when!three!subsequent! probes!indicate!a!increase/decrease.!A!call!made!by!one!probe!can!be!a!false!positive,!and!thus!3!probes!is!chosen.!This!is!not!the!case!in!NGS!data! though,!this!call!should!be!considered!‘true!positive’.!
False positive: Actual CNV found by NGS and 1-probe array CGH There!are!multiple!examples!of!false!positives!that!display!a!CNV!occurrence,!but!some!of!the!FPs!contain!a!lot!of!empty!bins!and!low!coverage!data.! This!indicates!that!some!of!the!FPs!are!actually!not!found!by!the!array!CGH!but!should!be!found,!as!well!as!some!FPs!that!are!really!falsely!called.!First! an!example!of!a!‘correct’!false!positive.!
41#
This!is!an!example!of!a!call!that!is!made!using!the!NGS!data!but!is!not!accurate.!Clearly!this!call!contains!empty!bins!in!both!the!sample!and!reference! dataset!and!the!log2!ratio!is!also!neutral.!The!arrayCGH!data!shows!also!an!neutral!log2Qratio!and!the!signals!itself!do!not!indicate!an!increase!or! decrease.!This!call!should!remain!a!false!positive.!
False positive: Not an CNV
!
42#
This!picture!clearly!displays!an!increase!of!the!sample!read!depth!over!the!reference!read!depth.!The!arrayCGH!data!does!not!display!a!decrease! (flipped!increase!again).!This!is!remarkable.!!
False positive: Actual CNV not found by arrayCGH
!
43#
! This!figure!shows!a!increase!in!log2Qratio,!indicating!a!duplication!of!this!region!in!the!sample.!Looking!at!the!readQdepth!data,!the!read!depth!data!of! the!reference!is!low.!This!can!indicate!the!following:!1)!There!is!indeed!an!duplication!of!this!region!in!the!sample.!2)!There!is!no!duplication!of!this! region!in!the!sample,!but!a!deletion!in!the!reference.!! A!deletion!in!the!reference!is!not!very!likely,!as!this!reference!is!a!pool!of!100!individuals.!This!means!that!any!CNV!that!occurs!in!one!individual!is! smoothed!out!by!the!others,!resulting!in!a!neutral!signal.!As!most!individuals!sequenced!at!the!VUMC!are!of!Dutch!origin,!another!explanation!can!be! found.!If!the!individuals!of!the!pool!are!all!from!American!descendant,!it!can!be!possible!that!this!specific!deletion!occurs!regularly!in!Americans,!but!not! in!Dutchmen/Europeans.!This!can!introduce!false!positives!that!are!actual!CNVs,!but!in!the!reference!instead!of!the!sample.!! In!the!array!CGH!data!this!CNV!is!also!found!when!looking!at!the!1Qprobe!calls!(in!this!particular!sample!there!is!one!other!case!where!there!is!a! significant!decrease!in!reference!readdepth!and!neutral!sample!readdepth,!here!the!CNV!is!not!called!in!the!arrayCGH!data).!
False positive: Actual call, but is is because of an increase in sample RD or decrease in reference RD?
44#
! The!above!figures!display!a!CNV!found!by!the!arrayCGH!data!but!not!found!in!the!NGS!data!because!of!(almost)!no!coverage!in!this!region.!No!coverage! can!indicate!the!following:!reads!that!can!be!mapped!to!this!region!can!also!be!mapped!to!other!regions!of!the!genome!and!are!therefore!deleted!from! the!dataset.!In!the!arrayCGH!data!there!is!also!a!region!that!has!no!probes,!which!can!be!the!result!of!a!known!repeat!region.!In!my!opinion!there!is!not! enough!data!to!classify!this!call!as!a!true!CNV.!!
False!negative:!CNV!is!not!found!by!NGS!because!of!(almost)!no!coverage!
45#
! This!depicts!a!true!false!negative.!This!call!is!found!on!the!arrayCGH!data!but!not!by!the!NGS!caller,!while!it!is!clear!that!there!is!indeed!a!deletion!in!the! data.!The!fact!that!this!CNV!is!not!found!can!be!explained!by!a!couple!of!factors.!It!can!be!the!case!that!the!windowsize!is!too!big!for!this!particular!call.!It! may!be!possible!that!something!goes!wrong!with!the!segmentation!due!to!this!windowsize.!Another!reason!may!be!the!chosen!threshold!that!is!used.!In! the!DWACQseq!data!there!is!a!small!indication!that!this!region!is!indeed!a!CNV,!but!this!region!does!not!have!enough!deviation!from!the!mean!as!to!be! considered!for!further!analysis.!!
False!negative:!True!CNV!that!is!not!called!by!the!NGS!caller!*Thus!a!true!False!negative!
46#
! These!images!display!a!CNV!found!by!the!arrayCGH!data.!The!resulting!deletion!is!in!my!opinion!not!defined!enough!to!be!considered!as!a!real!CNV.!The! arrayCGH!signal!also!lies!below!the!calling!threshold,!so!I!don’t!really!understand!why!this!is!a!call!at!all.! !
False positive: Not a real CNV
47#
# # When#I#look#at#the#above#analysis#of#the#different#kinds#of#TP,#FP#and#NPs,#I#think#I#can#devide#them#into#a#couple#of#classes:# • True#positive#set:## o Calls#that#are#actual#CNVs## o Calls#that#are#not#actual#CNVs#but#are#derived#from#low#coverage#data# # • False#positive#set:# o Actual#CNVs#that#are#called#by#NGS#and#are#called#by#16probe#arrayCGH#data# o Calls#that#are#made#by#NGS#data#that#are#based#on#low/none6coverage#data#and#are#not#considered#as#true#CNVs# o Actual#CNVs#that#are#called#by#NGS#but#no#indication#of#a#call#is#made#by#the#arrayCGH#data# o CNVs#that#may#be#called#because#of#an#anomaly#in#the#reference#(sometimes#found#by#the#aCGH#data).## # • False#negative#set:# o CNVs#that#are#not#detected#because#there#is#(almost)#no#coverage#in#this#region.#Most#bins#are#empty.## o A#actual#CNV#is#detected#in#the#region,#but#the#NGS#caller#does#not#find#it#(TRUE#False#Negative)# o Call#is#not#an#actual#CNV# # # Main#conclusion#from#this#observation#is#that#we#cannot#take#the#array#CGH#calls#as#a#absolute#truth.#This#set#is#also#comprimised#by#the#positioning#of#the# probes#and#the#decision#that#3#consecutive#probes#should#indicate#an#increase/decrease#filters#out#a#lot#of#false#positives#but#also#filters#out#correctly#classified# CNVs.# # This#conclusion#can#also#be#made#by#looking#at#the#statistics#of#the#TP,#FP#and#FN#sets.#Calculation#of#multiple#statistics#give#no#clear#overview#of#variables#that# can#determine#if#a#false#positive#is#indeed#a#CNV#or#can#be#classified#as#non6CNV,#because#there#is#also#so#much#variation#in#the#True#positive#set.## # This#led#me#to#the#following#action:#The#calls#that#are#classified#as#TP,#FP#and#FN#are#manually#annotated#to#set#a#new#‘golden#standard’.#This#is#done#as#follows:## • The#manual#annotation#is#made#by#looking#at#the#plotted#log26ratio#and#read#depth#of#the#sample#and#reference#of#the#NGS#dataset.#Three#classes#are# made:#true#CNVs#(“*”);#false#CNVs#(“N”);#doubtful#(“T”).## • Statistics#are#calculated#to#see#if#there#can#be#made#a#selection#criterion#for#these#three#classes.#My#hypothesis#is#that#there#can#be#made#an#selection# between#True#and#False#CNVs,#because#these#CNVs#are#made#up#from#normal#coverage#data#of#both#the#sample#and#reference.#False#CNVs#are#CNVs# that#have#almost#no#coverage#at#the#region.#The#doubtful#group#has#properties#that#lie#between#these#two#classes;#for#example#a#clear#call#is#visible,# but#there#is#also#very#low#coverage.#I#hope#that#by#setting#a#discriminative#rule#between#*#and#N#that#some#of#the#doubtful#cases#will#also#clear#up.# • The#statistics#that#are#calculated#is#as#follows:# o Percentage#of#empty#bins#of#the#total#number#of#bins#in#the#call# o Percentage#of#the#reference#bins#in#the#call#that#are#empty# o Percentage#of#the#sample#bins#in#the#call#that#are#empty# o Mean#read#depth#of#sample#and#reference#over#the#whole#chromosome#and#in#the#call# o Variance#in#the#read#depth#of#sample#and#reference#over#the#whole#chromosome#and#in#the#call# o Mean#of#the#log26ratio#over#the#whole#chromosome#and#in#the#call#itself# o Variance#of#the#log26ratio#over#the#whole#chromosome#and#in#the#call#itself.##
110684 51732 109560 33756463 33756463 90612 238682 52426 68567 40138 24311 111919 152279 166212 92212 18652 111253 54500 107523 331682 57029 133718 40968
7622506 34772407 70117194 95662879 95662879 32482818 40305088 34435026 53552258 152547657 155732106 162513531 69342855 653052 78942796 18944209 55349292 75535971 16655128 154928737 64639 248968 39353217
7733190 34824139 70226754 129419342 129419342 32573430 40543770 34487452 53620825 152587795 155756417 162625450 69495134 819264 79035008 18962861 55460545 75590471 16762651 155260419 121668 382686 39394185
111 53 111 33757 33757 91 240 53 70 41 25 112 153 167 93 20 113 55 109 332 58 135 42
0 0 0 121 121 35 0 0 0 0 0 19 1 7 0 0 0 0 2 8 0 0 0
0 0 0 0.358444175 0.358444175 38.46153846 0 0 0 0 0 16.96428571 0.653594771 4.191616766 0 0 0 0 1.834862385 2.409638554 0 0 0
0 0 0 0.322895992 0.322895992 37.36263736 0 0 0 0 0 16.96428571 0 4.191616766 0 0 0 0 0.917431193 2.108433735 0 0 0
0 0 0 0.035548183 0.035548183 1.098901099 0 0 0 0 0 0 0.653594771 0 0 0 0 0 0.917431193 0.301204819 0 0 0
128.77955 66.529358 41.923713 58.741297 58.741297 5.1640253 55.690085 205.67321 209.15356 126.19447 168.91452 39.52677 87.514229 76.196334 114.58583 199.26126 107.59739 149.14279 70.447005 187.2604 180.02969 129.41047 88.29149
84.6078083 85.7023881 67.7642589 105.409022 105.409022 19.3198483 107.536017 133.576281 109.477933 61.411339 111.629201 63.7207466 67.7878989 54.9867215 86.8832792 117.399939 85.7077801 104.158242 48.0830912 93.4381657 133.1478 164.007599 116.98636
0.637959336 ,0.3972882 ,0.76792778 ,1.01333914 ,1.01333914 ,1.63318189 ,0.99596298 0.657323787 ,0.03573994 1.18960771 0.60224184 ,0.54449782 0.392710547 0.378542019 0.430614455 0.70914323 0.350035727 0.567308212 0.539004448 0.0228415 0.514404857 ,0.33949927 ,0.38778993
0.65112568 0.430551195 1.080137607 0.959592965 0.959592965 3.882431996 1.33431076 0.480084995 1.039959247 1.925740704 0.521688087 0.702246634 0.472036642 0.76045932 0.463837413 0.835857416 0.340445472 0.582036477 0.620247198 1.241066606 0.362679341 0.27860901 0.550544471 #
48#
# As#can#be#seen#in#the#table,#the#percentages#of#empty#bins#in#the#data#is#low.#There#is#one#high#number,#38%.#After#looking#at#the#data,#the#log26ratio#in#this# region#is#pretty#variable,#which#makes#the#call#doubtfull.#I#did#see#this#as#a#call#though,#this#high#percentage#may#be#filtered#out#looking#at#the#variance#of#the# log26ratio.#This#call#is#the#only#one#that#has#a#variance#of#more#the#2.#It#can#be#stated#that#in#this#dataset,#if#a#call#has#a#percentage#of#total#empty#bins#less#than# 20%,#it#can#be#considered#True.## # #
4 4 4 5 5 6 7 17 23 1 2 3 4 5 6 11 11 16 17 23 4 6 22
o Percentage#of#coverage#in#the#call#compared#to#the#mean#coverage#calculated#over#the#whole#chromosome#for#both#sample#and#reference.# # Results#of#the#True#CNV#dataset#(*)#are#as#follows:# # Percentage# Percentage# Mean# Variance# Percentage# Percentage# Number# Percentage# Total# of#coverage# of# coverage# log26 # log26ratio# of# empty# of# empty# of# of# empty# number# compared# compared# ratio#in# in#region# # bins# bins# empty# bins#Total# of#bins# to# mean# to#mean#ref# region# # Reference# Sample# bins# sample# #
144881 143918004 144062885 805653 7099503 7905156 159305 39229932 39389237 2143843 20181574 22325417 1322257 32461673 33783930 70331 85637584 85707915 329610 17009417 17339027 570733 69216989 69787722 157815 36261055 36418870 403618 60046 463664 1113902 459655 1573557 1018975 1682606 2701581 4180 2695395 2699575 17585 154919455 154937040 58639 72759756 72818395 141725 104160240 104301965 791799 69625288 70417087 52333 24346708 24399041
146 806 160 2144 1323 71 331 572 159 405 1115 1020 6 19 59 143 793 53
53 483 148 1119 409 62 62 345 35 129 210 62 0 1 44 128 674 46
Number# of# empty# bins#
36.30136986 59.92555831 92.5 52.19216418 30.91458806 87.32394366 18.73111782 60.31468531 22.01257862 31.85185185 18.83408072 6.078431373 0 5.263157895 74.57627119 89.51048951 84.99369483 86.79245283
Percentage# of# empty# bins#Total#
32.87671233 54.21836228 92.5 49.76679104 28.79818594 87.32394366 18.42900302 55.94405594 18.86792453 30.61728395 17.75784753 6.078431373 0 5.263157895 74.57627119 89.51048951 80.83228247 86.79245283
Percentage# of# empty# bins# Sample#
3.424657534 5.70719603 0 2.425373134 2.116402116 0 0.302114804 4.370629371 3.144654088 1.234567901 1.076233184 0 0 0 0 0 4.161412358 0
Percentage# of# empty# bins# Reference#
10.6737976 8.70774202 3.53123292 36.4569004 55.2733193 9.32185986 94.4789244 7.60764413 119.217427 89.8093095 118.489553 157.426951 164.856921 139.337274 26.6474889 2.34166835 3.48281925 0.69046225
19.0910968 12.5905419 70.4553216 44.420472 75.0331614 103.448246 74.411221 11.2352975 91.3519096 39.3975314 58.3550136 78.5360793 94.6357754 86.4299471 55.4783791 5.25079219 3.51816882 66.7016359
Percentage# Percentage# of#coverage# of# coverage# compared# compared# to# mean# to#mean#ref# sample#
,0.5628135 ,0.5119471 ,1.8826398 ,0.3040936 ,0.2945812 ,2.2621667 0.2865471 ,0.5290243 0.42280064 0.25165007 0.08019613 0.02715191 ,0.2206631 ,0.4530009 ,0.0855423 ,1.290372 ,0.04427 ,2.7918303
Mean# log26 ratio#in# region#
1.81574896 1.46756533 7.46767379 0.89960925 0.65114663 4.57720476 0.46518276 1.4175821 0.49388071 1.96569875 1.618926 1.29805815 0.88916902 0.8186712 0.50477282 2.97139214 1.03910869 9.16029479 #
Variance# log26ratio# in#region#
49#
# # Looking#at#this#data,#the#percentage#of#the#bins#in#the#region#that#are#empty#is#much#higher.#The#last#page#a#threshold#of#about#20%#is#mentioned.#If#this#is#a# real#threshold#there#are#4#calls#here#that#are#wrongly#classified.#Noticeable#is#that#this#occurs#in#chromosome#X.#These#calls#are#mostly#made#on#the#boundary# of# a# larger# CNV,# therefore# I# classified# them# as# a# False# CNV.# I# should# look# into# this# by# enlarging# these# regions.# I# will# try# and# report# back# with# these# results.# Furthermore#the#boundary#of#20#seems#an#acceptable#guess.## #
7 8 8 15 16 23 1 9 17 23 23 23 23 23 1 1 5 22
Total# number# of#bins#
Results#of#the#False#CNV#dataset#(“N”)#are#as#follows:#
8 12 14 1 7 9 16 17 18 23
79833 11960354 12040187 86849 9628450 9715299 60519 106869902 106930421 105350 25565413 25670763 31054 115910398 115941452 2402869 40762481 43165350 287996 2022575 2310571 113092 18347753 18460845 56587 13306 69893 129462 1564267 1693729
81 88 61 107 32 2404 289 114 58 131
36 4 0 5 9 1851 1 88 8 0
44.4444444 4.54545455 0 4.6728972 28.125 76.9966722 0.34602076 77.1929825 13.7931034 0
Number# Percentage# of# of# empty# empty# bins#Total# bins#
40.7407407 1.13636364 0 3.73831776 28.125 69.5507488 0.34602076 76.3157895 10.3448276 0
Percentage# of# empty# bins# Sample#
3.7037037 3.40909091 0 0.93457944 0 7.44592346 0 0.87719298 3.44827586 0
Percentage# of# empty# bins# Reference#
10.0099724 62.6469039 64.8701129 74.7268166 72.9895577 6.34706271 95.1407432 1.72159038 26.5387135 144.462632
Percentage# of#coverage# compared# to# mean# sample#
27.8738793 30.2830382 102.855583 56.3073926 122.101088 4.77494721 77.0505761 24.9127121 43.5866202 81.2945685
Percentage# of# coverage# compared# to#mean#ref#
,1.2704933 1.16769676 ,0.689679 0.45347112 ,0.2762624 0.33277438 0.35618715 ,1.7114673 ,0.6953433 ,0.1502356
Mean# log26 ratio#in# region#
2.76804113 1.95861661 0.83564895 0.60315066 0.26722349 1.17309891 0.41073752 4.89631025 1.31175769 1.19663441 #
Variance# log26ratio# in#region#
N# *# *# N# N# N# N# N# N# N#
50#
# # Here#the#differences#are#more#noticeable.#Of#the#calls#in#this#list,#I’m#not#really#sure#where#to#classify#them.#I#did#write#what#I#initially#thought,#this#list#is#at#the# back#of#the#table.#When#looking#at#the#percentages,#the#two#stars#do#actually#have#a#low#percentage#of#empty#bins.#When#looking#at#the#other#percentages,# there#are#more#calls#with#percentages#lower#than#20.#The#corresponding#variances#of#these#calls#are#also#below#2.# # It#is#important#to#state#that#the#lower#the#coverage#of#the#sample#is,#the#harder#to#determine#if#the#call#is#an#actual#CNV#or#not.#This#results#in#a#large#DOUBT#set# (and#small#*#and#N#set)#when#analysing#the#calls#of#sample#19#and#20.#This#will#also#be#the#case#when#doing#the#dilution#experiments,#but#here#we#can#check#my# observations#on#the#low#coverage#data#with#the#observations#in#the#high#coverage#data.## # To#determine#the#correct#cut6offs,#this#analysis#is#also#repeated#with#the#other#samples.#The#KG#sample#has#more#coverage#than#the#above#samples.#The#results# are#similar.#One#thing#is#interesting#though:#A#lot#of#calls#in#the#True#CNV#dataset#consist#of#a#high#number#of#empty#bins#in#one#of#the#sample#or#reference.#In# this#case,#the#one#with#the#most#zero#bins#has#a#deletion#at#this#location.#By#just#filtering#out#the#high#empty#bin#percentages,#you#would#lose#these#CNVs.#The# following#is#proposed:# • If#both#sample#and#reference#contain#no#empty#bins##Keep#as#call# • If# both# sample# and# reference# contain# empty# bins## if# the# total# percentage# of# empty# bins# (sample# +# ref)# is# higher# than# 20%# (this# value# is# not# very# robust#yet,#relies#on#two#datasets:#17%)#the#call#is#considered#non6CNV.#Lower#=#CNV.# • If#sample#OR#reference#contains#empty#bins#but#the#other#has#a#percentage#of#0%##If#‘Percentage#of#coverage#compared#to#mean#ref’#>#50%#AND# ‘Percentage#of#coverage#compared#to#mean#Sample’#>#10%##Keep#as#call.# • Question#regarding#last#criteria;#should#this#also#be#the#other#way#around?#In#this#way#errors#in#the#reference#are#seen#as#calls#in#the#sample.#Yes/No?# # This#proposition#is#put#into#a#script#and#run#to#see#its#results.##
#
Results#of#the#Twijfel/doubt#dataset#(“T”)#are#as#follows:# # # # Total# # number# of#bins# # #
Calls:# # # # # 4 110684 7622506 4 51732 34772407 4 109560 70117194 5 33756463 95662879 5 33756463 95662879 7 238682 40305088 12 86849 9628450 14 60519 106869902 17 52426 34435026 23 68567 53552258 1 105350 25565413 1 40138 152547657 2 24311 155732106 3 111919 162513531 4 152279 69342855 5 166212 653052 6 92212 78942796 7 31054 115910398 11 18652 18944209 11 111253 55349292 16 287996 2022575 16 54500 75535971 17 107523 16655128 18 56587 13306 23 129462 1564267 23 1018975 1682606 23 4180 2695395 23 17585 154919455 23 331682 154928737 1 58639 72759756 4 57029 64639 6 133718 248968 22 40968 39353217 # Hits#that#are#filtered#out:# Total# number# of#bins#
7733190 111 34824139 53 70226754 111 129419342 33757 129419342 33757 40543770 240 9715299 88 106930421 61 34487452 53 53620825 70 25670763 107 152587795 41 155756417 25 162625450 112 69495134 153 819264 167 79035008 93 115941452 32 18962861 20 55460545 113 2310571 289 75590471 55 16762651 109 69893 58 1693729 131 2701581 1020 2699575 6 154937040 19 155260419 332 72818395 59 121668 58 382686 135 39394185 42
Total# number# of#bins#
0 0 0 0.32289599 0.32289599 0 1.13636364 0 0 0 3.73831776 0 0 16.9642857 0 4.19161677 0 28.125 0 0 0.34602076 0 0.91743119 10.3448276 0 6.07843137 0 5.2631579 2.10843374 74.5762712 0 0 0 Percentage# of# empty# bins# Sample# Number# Percentage# of# of# empty# empty# bins#Total# bins#
Percentage# of# empty# bins# Sample#
0 0 0 0.35844418 0.35844418 0 4.54545455 0 0 0 4.6728972 0 0 16.9642857 0.65359477 4.19161677 0 28.125 0 0 0.34602076 0 1.83486239 13.7931035 0 6.07843137 0 5.2631579 2.40963855 74.5762712 0 0 0
0 0 0 121 121 0 4 0 0 0 5 0 0 19 1 7 0 9 0 0 1 0 2 8 0 62 0 1 8 44 0 0 0
Number# Percentage# of# of# empty# empty# bins#Total# bins#
84.6078083 85.7023881 67.7642589 105.409022 105.409022 107.536017 30.2830382 102.855583 133.576281 109.477933 56.3073926 61.411339 111.629201 63.7207466 67.7878989 54.9867215 86.8832792 122.101088 117.399939 85.7077801 77.0505761 104.158242 48.0830913 43.5866202 81.2945685 78.5360794 94.6357754 86.4299471 93.4381657 55.4783791 133.1478 164.007599 116.98636 Percentage# of# coverage# compared# to#mean#ref# Percentage# of#coverage# 51# compared# to# mean# sample# Percentage# of# empty# bins# Reference#
Percentage# of# coverage# compared# to#mean#ref#
128.779554 66.5293584 41.923713 58.7412969 58.7412969 55.6900846 62.6469039 64.8701129 205.673206 209.15356 74.7268167 126.194472 168.914517 39.5267704 87.5142287 76.1963337 114.585835 72.9895577 199.261264 107.597393 95.1407432 149.142793 70.4470047 26.5387135 144.462632 157.426951 164.856921 139.337274 187.260404 26.6474889 180.029687 129.410474 88.2914902
Percentage# of#coverage# compared# to# mean# sample#
0 0 0 0.03554818 0.03554818 0 3.40909091 0 0 0 0.93457944 0 0 0 0.65359477 0 0 0 0 0 0 0 0.91743119 3.44827586 0 0 0 0 0.30120482 0 0 0 0
Percentage# of# empty# bins# Reference#
Mean# log26 ratio#in# region#
0.63795934 ,0.3972882 ,0.7679278 ,1.0133391 ,1.0133391 ,0.995963 1.16769676 ,0.689679 0.65732379 ,0.0357399 0.45347112 1.18960771 0.60224184 ,0.5444978 0.39271055 0.37854202 0.43061446 ,0.2762624 0.70914323 0.35003573 0.35618715 0.56730821 0.53900445 ,0.6953433 ,0.1502356 0.02715191 ,0.2206631 ,0.4530009 0.0228415 ,0.0855423 0.51440486 ,0.3394993 ,0.3877899
Mean# log26 ratio#in# region#
Variance# log26ratio# in#region#
0.65112568 0.4305512 1.08013761 0.95959297 0.95959297 1.33431076 1.95861661 0.83564895 0.480085 1.03995925 0.60315066 1.9257407 0.52168809 0.70224663 0.47203664 0.76045932 0.46383741 0.26722349 0.83585742 0.34044547 0.41073752 0.58203648 0.6202472 1.31175769 1.19663441 1.29805815 0.88916902 0.8186712 1.24106661 0.50477282 0.36267934 0.27860901 0.55054447
Variance# log26ratio# in#region#
* * * * * * T T * * T * * * * * * T * * T * * T T N N N * N * * *
#
52#
6 90612 32482818 32573430 91 35 38.4615385 37.3626374 1.0989011 5.16402531 19.3198483 ,1.6331819 3.882432 * 7 144881 143918004 144062885 146 53 36.3013699 32.8767123 3.42465753 10.6737976 19.0910968 ,0.5628135 1.81574896 N 8 805653 7099503 7905156 806 483 59.9255583 54.2183623 5.70719603 8.70774202 12.5905419 ,0.5119471 1.46756533 N 8 79833 11960354 12040187 81 36 44.4444444 40.7407407 3.7037037 10.0099724 27.8738793 ,1.2704933 2.76804113 T 8 159305 39229932 39389237 160 148 92.5 92.5 0 3.53123292 70.4553216 ,1.8826398 7.46767379 N 15 2143843 20181574 22325417 2144 1119 52.1921642 49.766791 2.42537313 36.4569004 44.420472 ,0.3040936 0.89960925 N 16 1322257 32461673 33783930 1323 409 30.9145881 28.7981859 2.11640212 55.2733193 75.0331614 ,0.2945812 0.65114663 N 23 70331 85637584 85707915 71 62 87.3239437 87.3239437 0 9.32185986 103.448246 ,2.2621667 4.57720476 N 1 329610 17009417 17339027 331 62 18.7311178 18.429003 0.3021148 94.4789244 74.411221 0.2865471 0.46518276 N 9 2402869 40762481 43165350 2404 1851 76.9966722 69.5507488 7.44592346 6.34706271 4.77494721 0.33277438 1.17309891 T 9 570733 69216989 69787722 572 345 60.3146853 55.9440559 4.37062937 7.60764413 11.2352975 ,0.5290243 1.4175821 N 17 113092 18347753 18460845 114 88 77.1929825 76.3157895 0.87719298 1.72159038 24.9127121 ,1.7114673 4.89631025 T 17 157815 36261055 36418870 159 35 22.0125786 18.8679245 3.14465409 119.217427 91.3519096 0.42280064 0.49388071 N 23 403618 60046 463664 405 129 31.8518519 30.617284 1.2345679 89.8093095 39.3975314 0.25165007 1.96569875 N 23 1113902 459655 1573557 1115 210 18.8340807 17.7578475 1.07623318 118.489553 58.3550136 0.08019613 1.618926 N 1 141725 104160240 104301965 143 128 89.5104895 89.5104895 0 2.34166835 5.25079219 ,1.290372 2.97139214 N 5 791799 69625288 70417087 793 674 84.9936948 80.8322825 4.16141236 3.48281925 3.51816882 ,0.04427 1.03910869 N 22 52333 24346708 24399041 53 46 86.7924528 86.7924528 0 0.69046225 66.7016359 ,2.7918303 9.16029479 N Discussing#the#above#two#tables#results#in#the#following#observations:#most#of#the#starred#dataset#turns#up#in#the#‘call’#set#(only#one#is#missed),#and#most#of#the# negatives# are# not# called.# The# negatives# in# the# call# list# can# be# explained# by# the# fact# that# these# regions# lie# on# the# borders# of# a# true# CNV.# Some# of# these# calls# should#not#have#been#classified#as#a#Negative,#I#can#see#that#now.#For#example,#one#of#the#FNs#I#classified#as#negative,#where#there#is#definitely#a#case#of#low# coverage#for#reference,#no#coverage#for#sample.#This#shall#be#run#again.## # This#is#also#done#for#the#1MM#dataset#to#see#the#differences#in#this#case.#The#resulting#tables#look#similar,#most#stars#are#found#in#the#‘call’#set#(5#are#in#the# wrong#set#and#shall#be#rechecked#to#see#if#they#are#indeed#CNVs).#All#the#negative#calls#are#in#the#‘filtered’#set.#T#(doubt)#is#divided#over#the#two#sets.## The#five#CNVs#that#are#wrongly#classified#in#the#‘filtered’#set#are#analyzed#again.## • Three#calls#are#definite#low/no#coverage#for#reference/sample#calls.#ALL#bins#in#the#reference#contain#some#signal;#almost#bins#of#the#sample#are#empty.# • One#call#is#a#definite#low/no#coverage#for#reference/sample#calls.#Almost#all#bins#in#the#reference#contain#some#signal;#almost#all#bins#of#the#sample#are# empty.## • One#call#contains#a#dip#in#the#reference#signal#and#not#in#the#sample#signal;#the#coverage#of#the#reference#dips#below#50%#and#is#therefore#filtered.# In#this#case#three#of#these#calls#can#be#obtained#by#dropping#the#requirement#that#the#coverage#of#the#sample#should#be#above#10%.#This#results#in#the#other# samples# that# some# of# the# negative# calls# will# surface# in# the# ‘call’# file.# In# my# opinion# it# is# more# important# to# include# most# positives# at# the# expense# of# extra# negatives#in#the#positive#set,#than#the#other#way#around.##
# # # # #
#
53#
Allowing)a)mismatch)>)comparison# # Below#are#the#observed#differences#with#the#data#described#above.#I’ve#concentrated#on#the#two#high#coverage#samples#for#the#time#being,#the#low#coverage# samples#will#be#done#later.## # Daphne1# Daphne1_1MM# Difference# KG# KG_1MM# Difference# 0MM/1MM# 0MM/1MM# TP# 18# 21# +2#/#+2# 44# 55# +0#/#+2# FP# 26# 29# +4#/#+7# 28# 50# +4#/#+23# FN# 7# 6# +3#/#+2# 10# 7# +3#/#+0# # As#can#be#seen#in#the#above#figure,#there#is#a#difference#between#the#two#settings.#Allowing#a#mismatch#results#in#a#higher#number#of#false#positives#and#in#one# of#the#samples#also#a#higher#number#in#true#positives#(this#can#be#concluded#from#the#‘difference’#columns,#where#I#looked#at#the#microarray#calls#and#checked# which#calls#were#not#found#by#the#other#sample).#The#higher#number#TPs#for#Daphne1_1MM#can#be#explained#by#the#fact#that#multiple#NGS#calls#overlap#a# array#CGH#call.## # The$main$question$is:$how$is$the$structure$of$the$calls$that$are$found$by$one$method$and$not$by$the$other?## Starting#with#the#TPs,#I#looked#at#the#two#calls#that#were#in#both#sets#‘extra’.#In#the#0MM#dataset#one#call#had#low/no#coverage#issues.#The#other#call# can# be# seen,# but# does# not# stand# out# much.# The# calls# found# in# 1MM# are# more# distinguished.# One# shows# a# slight# increase# in# sample# signal# and# none# in# the# reference#and#the#other#call#has#no#coverage#for#the#sample#but#does#have#coverage#in#the#reference.#In#this#particular#case#the#choice#for#1MM#would#be# better.#KG#shows#the#same#result:#two#extra#calls#are#found.#One#call#is#not#clear,#but#the#other#shows#coverage/no#coverage#for#reference/sample.## Looking#at#the#FPs,#a#large#number#of#FPs#is#not#useful.#The#precise#nature#of#these#FPs#is,#so#I#looked#at#the#calls#that#are#different.#In#the#Daphne1# sample# four# FPs# are# given# extra# (3# not# distinguishable,# 1# distinguishable,# but# not# clearly).# In# 1MM# there# are# 7# extra# calls# (1# not# distinguishable,# 3# call,# 3# doubtful).#The#1MM#dataset#gives#half#of#the#extra#FPs#as#reliable#calls.#In#this#case,#1MM#is#the#better#choice.#The#KG#sample#shows#similar#results:#0MM#calls:# 1#low/no#coverage#for#reference/sample,#2#call,#1#doubtful.#The#1MM#KG#sample:#2#are#not#CNVs,#5#doubtful,#2#on#boundaries#of#big#calls,#1#low/no#coverage# for#reference/sample,#6#CNVs#and#7#calls#contain#mostly#empty#bins#(which#seems#to#occur#especially#much#in#this#particular#sample).#In#this#case,#1MM#is#also# the#better#choice.## The#false#negatives#give#a#similar#view.#One#extra#is#found#allowing#no#mismatch.#This#is#not#a#better#score,#thus#one#would#choose#1MM.#Looking#at# the#calls#itself,#one#of#the#calls#missed#by#0MM#contains#mainly#empty#bins,#one#a#slight#increase#in#sample#signal#(clear#CNV)#and#one#with#low/no#coverage#for# sample/reference.#The#calls#missed#by#1MM:#low#coverage#call#(doubtful)#and#a#slight#dip#in#signal#coverage#(also#doubtful).#Again#the#choice#for#1MM.#The# same#for#KG:#3#extra#FNs#compared#to#1MM.#Two#are#not#CNVs,#but#the#third#is#a#case#of#low/no#coverage#for#reference/sample.## # O#=#not#found#by#0MM# I#=#not#found#by#1MM# Sample#19#is#not#in#agreement#with#above#statements,#thus#this#is#displayed#here#differently.## # # Sample#no#data/ref#does:#OOO### # Low#coverage#in#both#sets:#OIOIOO#19I# # No#visible#cnv:#O### # Dip#als#CNV#(moet#wel#gevonden#worden):#IOO##19O19I# # Deletie#in#referentie:#19I#
54#
# The#above#are#examples#of#how#the#empty#bin#density#can#be#displayed.#The#left#figure#displays#a#clear#call,#there#are#no#empty#bins#in#this#CNV.#The#right#call# displays#a#deletion#in#the#sample.#There#is#almost#no#coverage#in#the#sample#(lower#than#10%).#By#the#filtering#that#was#applied#before,#this#call#was#filtered#out.# Because#there#is#clearly#a#call#visible,#the#filtering#is#adjusted#so#very#low#percentages#of#sample#are#allowed.#This#call#will#turn#up#in#the#correct#‘call’#file#in#the# future.#The#darker#the#red#area,#the#more#empty#bins#are#situated#(simply#said:#what#is#plotted#in#this#‘EB#Sample’#graph#is#a#0#if#the#bin#contains#data,#1#if#it# does#not.#Plotting#is#done#using#dark#red;#a#gradient#shows#the#lower#density).## #
One#of#the#phenomenons#that#I#observed#when#looking#at#the#calls,#is#that#DWAC6seq#has#a#difficulty#by#calling#CNVs#that#have#some#reference#coverage,#but# no#sample#coverage.#Sometimes#there#calls#are#found,#and#sometimes#not.#I#find#this#very#interesting.#DWAC6seq#itself#does#not#deal#in#any#way#with#empty# bins,# because# it# uses# a# number# of# reads# per# window.# Low# or# no# coverage# is# in# this# way# skipped.# It# could# be# that# in# this# way# some# regions# are# missed.# The# following#plots#are#made#to#show#the#density#of#the#empty#bins#in#a#couple#of#cases.# # #
Empty)bins)
55#
# Two#other#calls#that#are#filtered#that#remain#in#the#‘filtered’#file#after#removing#the#filter#of#sample#coverage#>#10%.#The#left#figure#shows#a#dip#in#the#reference# signal.#The#coverage#is#below#50%#of#what#coverage#that#is#expected,#and#is#thus#filtered#out.#In#this#case,#because#it#seems#that#the#CNV#is#situated#in#the# reference#and#not#in#the#sample,#it#is#not#very#important#that#this#call#is#lost#(depends#on#how#we#want#to#deal#with#cases#like#this).# The#right#figure#shows#another#call,#this#call#is#filtered#because#both#sample#and#reference#contain#empty#bins#and#the#total#percentage#of#empty#bins#is#above# 20%.#Question#remains#if#this#is#a#valid#CNV#and#not#some#noise#of#the#data.## # # The#filtering#applied#is#a#very#loose#way#to#get#rid#of#a#couple#of#easily#discernible#false#positives.#This#reduces#the#amount#of#calls#that#need#to#be#checked#by# hand/eye#to#an#acceptable#amount:#between#28650#as#we#determined#as#a#‘perfect’#range#and#is#in#accordance#with#(Feuk,#Carson#et#al.#2006).##
56#
#
# Same#as#described#above#for#the#empty#bins,#I#also#made#plots#for#the#probe#density#in#all#calls.#The#results#can#be#found#below.#A#bar#is#drawn#for#each#bin#that# has#the#height#of#the#number#of#probes#in#that#region.#Each#dot#has#one#bar.#As#can#be#seen,#on#average#the#probes#are#evenly#distributed#over#the#genome.# Most#plots#look#like#the#one#below#on#the#left.#There#are#exceptions#of#course,#one#of#them#shown#below#right.#Here#some#regions#do#not#have#any#probes.#The# design#of#the#locations#of#the#probes#is#build#by#keeping#in#mind#regions#of#the#genome#that#are#repetitive,#as#this#results#in#unreliable#hybridization#of#DNA#to# the#array.#The#decision#to#do#this#seems#legit,#but#it#is#very#unlikely#that#the#design#is#perfectly#aligned#with#these#regions.#NGS#does#not#have#this#problem,#as#it# is#not#dependent#on#probe#positioning#beforehand.## #
Probe)density)
notice that the performance for both runs of sample 4 are more or less the same. Even when the data of the two runs is combined, the Sample Class Count Pos. Neg. Und. performance stays the same. Sample 1 Total 60 23 12 25 We also analyzed the runs by using all measurements done with Merged: 30 Passed 43 (71.7%) 23 3 17 the combined control sample, AR1,2,3 (which results in an increase Deleted 17 (28.3%) 0 9 8 of the control sample coverage to 4.5x). The result is shown in the Sample 2 Total 51 17 10 26 second row in table 5, which shows that the false positives increase Merged: 23 Passed 30 (58.8%) 12 1 17 strongly. This is probably due to the fact that the size of the window Deleted 21 (41.2%) 3 9 9 is incorrect because the window size is expressed in the number of Sample 3 Total 85 51 11 23 reads aligned per window and the coverage of the control sample Merged: 43 Passed 71 (83.5%) 50 4 17 drastically increased. Therefore we tried two alternative settings Deleted 14 (16.5%) 1 7 6 for the window size, 350 and 700. It can be observed that the Sample 4R1 Total 57 27 9 21 number of false positives do drop but at the same time the number of Merged: 40 Passed 43 (75.4%) 25 2 16 true positives also drops, so that the performance eventually is not Deleted 14 (24.6%) 2 7 5 improved when using the combined data for the control sample. # Also the number of merged calls (last column in table 5) shows that combining data for the control sample does not help as the Array)CGH)data)plots) number of merged calls when using the same coverage for the 2.9 As#could#be#seen#in#some#of#the#above#figures#with#the#aCGH#data,#some#of#the#calls#that#are#made#are#not#laying#above/under#the#designated#thresholds#that# Determining the minimal coverage control sample as for the test sample is closest to the expected are#used#for#calling#CNVs.#This#is#probably#due#to#the#fact#that#Nexus#does#some#additional#analysis#on#this#data#before#displaying#the#calls#for#visual#inspection.# To determine the minimal coverage for the NGS data to still number of calls (somewhere between 30 and 50, see section 2.4). What#this#particular#method#entails#still#needs#to#be#figured#out.#I#will#spend#an#afternoon/morning#on#one#of#the#computers#at#the#VUmc#to#load#some#of#my# reliably detect aberrated regions we set up a simulation in which files#into#Nexus.#In#this#way#I#can#also#look#at#how#my#calls#are#displayed#in#Nexus.## the coverage is artificially reduced before applying the detection A#specific#input#format#is#required,#I’ll#see#if#this#is#applicable#for#my#data.##2.11 Multiplexing test and control sample tools. This is done using sample 3 and 4, of which sample 4 is ## sequenced in two different runs (sample 4R1 and sample 4R2 ). The As multiple factors can influence the process of sequencing, we This#issue#is#fixed,#the#log6ratio#that#was#used#in#Feature#extraction#is#not#used#by#Nexus.#Feature#extraction#calculates#a#log106ratio,#Nexus#employs#a#log26 original coverage of 3.6, 1.2 and 1.0 for all samples, respectively are expect that the best results will be possible when the test sample and ratio.#After#fixing#this,#the#plots#look#like#they#look#in#Nexus.#The#calls#that#are#made#are#also#the#same#as#the#calls#I#received#some#time#ago.# reduced, by uniform sampling, in steps of two to a coverage between the control sample are sequenced in the same lane (multiplexing, for 0.1 and 0.45. The resulting calls are analyzed and observations can details see figure 8). The setting in 57#which the corresponding test and be found in table 4. control sample are multiplexed are colored red in table 6. We expect The lower coverages show an increase in the total number of calls that multiplexing reduces biases that can occur due to differences
Table 3. Results of the rule-filter applied on the detected regions from the Filtering)done)on)low)coverage)data) respectively. This turning point is determined by looking at the NGS data, when comparing them to the manual annotated set of calls. number of calls after merging, this number is kept below 50. # The manual annotation split the combined calls based on the NGS and Concluding from this reduction experiment, the lowest coverage that As#stated#before,#it#is#hard#to#determine#with#accuracy#which#of#the#calls#of#the#low#coverage#datasets#are#‘true’#and#‘false’.#The#‘doubt’#dataset#is#much#larger# aCGH data in three classes: positive set (actual CNVs), negative set (no than#with#the#high#coverage#dataset.## can be used for the detection of CNVs with reduced NGS data lies CNVs, or too low coverage), and a undetermined set (hard to determine One#of#the#other#remarkable#observations#is#that#in#these#low#coverage#samples,#sometimes#a#call#is#made#on#the#centromere#region.#This#is#not#the#case#with# at 0.23x. the class). The numbers in the columns ’Pos.’ (positive), ’Neg.’ (negative) the#high#coverage#samples.#This#centromere6calling#can#be#observed#in#both#0MM#and#1MM#samples.#These#calls#will#be#filtered#from#the#dataset#due#to#the# and ’Und.’ (undetermined) are the detected regions that are assigned to large#number#of#empty#bins,#but#it#is#an#interesting#observation.#One#explanation#could#be#that#the#ndupfiltering#applied#is#not#strict#enough,#that#the#small# the corresponding set. All calls that are generated by aCGH and NGS are 2.10 The coverage for the control sample should number#of#reads#remaining#in#the#tower#still#have#some#disturbing#influence#on#the#statistics.#The#reason#that#these#calls#overspan#the#whole#chromosome#is# assigned to a class. The ’Total’ row displays the total number of calls, and be close to the coverage of the test sample due#to#the#DWAC6algorithm.#The#windows#contain#a#number#of#reads,#not#a#set#size.#This#results#in#an#overlapping#window#between#the#last#reads#upstream#of# how these calls are manually annotated. The ’Passed’ row shows how many the#centromere#and#the#first#reads#downstream.#As#these#calls#are#filtered#out,#I#don’t#want#to#spend#a#lot#of#time#on#how#to#resolve#them#further.# Sample 4 and the control sample are sequenced in multiple runs (see calls of the three sets pass the filter for this sample, the ’Deleted’ row the for more details figure 8), which allows to compare the results of two calls #that do not pass the filter. The total number of calls after rule-filtering The#results#of#the#filtering#are#shown#below:# and merging is given in the first column. independent runs. Table 5, first row, shows the results and one can #
D.M. van Beek et al
58#
# The# dilution# experiments# were# started# to# find# the# lowest# coverage# that# is# needed# for# detection# of# CNVs.# The# assumption# is,# that# when# dividing# a# dataset# repeatedly#in#two#parts,#the#new#datasets#resemble#a#low#coverage#sequenced#sample.#As#seen#before,#there#seems#to#be#a#turningpoint#when#reaching#0.160.3# coverage,#so#dilution#takes#place#so#that#we#reach#this#level#of#coverage.#We#have#two#samples#that#can#be#used#for#this#experiment#(KG#and#Daphne1)#because# they#have#high#enough#coverage.#Daphne1#is#sequenced#in#two#different#runs,#so#we#can#even#study#the#effects#of#taking#samples#from#different#runs.## # The#following#setups#were#made:# # Sample#KG#–#aligned## # 4.097#x# Split#1# # # # 2#fold#of#2.05#x# Split#2# # # # 4#folds#of#1.025#x# Split#3# # # # 8#folds#of#0.512#x#/#0.454#x#(f)##Set#11#and#Set#12#and#…# Split#4# # # # 16#folds#of#0.256#x#/#0.227#x#(f)##SetG1#and#SetG2#and#…# Split#5# # # # 32#folds#of#0.128#x#/#0.114#x#(f)##SetG11#and#SetG12#and#…# # Sample#Daphne1#run1#–#aligned## 1.392#x# Split#1## # # # 2#folds#of#0.7#x# Split#2# # # # 4#folds#of#0.346#x#(f)##Set11#and#Set12#and#…# Split#3## # # # 8#folds#of#0.174#x#(f)##SetG1#and#SetG2#and#…# Split#4# # # # 16#folds#of#0.087#x#(f)##SetG11#and#SetG12#and#…# # Sample#Daphne1#run2#–#aligned# 1.190#x## Split#1## # # # 2#folds#of#0.6#x# Split#2# # # # 4#folds#of#0.264#x#(f)##Set11#and#Set12#and#…# Split#3## # # # 8#folds#of#0.132#x#(f)##SetG1#and#SetG2#and#…# Split#4# # # # 16#folds#of#0.066#x#(f)##SetG11#and#SetG12#and#…# # The#analysis#is#run#using#the#optimal#settings:#window#size#of#100#and#threshold#of#0.25.#The#results#are#displayed#below.## #
Dilution)experiments)
As#can#be#seen#in#the#table,#the#turning#point#is#clearly#visible.#A#very#high#jump#in#false#positives#is#seen#when#reaching#a#low#coverage.#For#reference#purposes#
As sample 4 is sequenced in two runs, as well as the microarray reference, these samples can be used to compare the effect of
2.10 Changes in coverage
that a large part of the false positives in the above figure are calls on the X-chromosome (which could be due to the use of a female reference while both sample 3 and 4 are male). The ’CNVs of interest’, classified in the clinic as Type II to IV that were present in these samples are found in all dilution levels. As we can not focus on these specific CNVs because we do not know the type in advance, it still is desirable to find them.
than the optimal setting set, of which we can determine that in the false positive set more calls are unreliable. As stated before, the average number of CNVs found in one individual is 12 [Feuk et al., 2006] and the average number of aCGH calls per sample 28. The best way to see if the tool is optimized for the data that is used, is to compare these numbers to the ’Merged’ column in Table 5. Using the microarray Pool R1 gives us results that lie in the range between 12 and 50. Using the high coverage microarray pool (HC), the window size of 350 gives results on the upper part of this range (less desirable) and 700 reads per window results in very low numbers, thus a low number of false positives.
59#
36 61 61 142 131 182 161
18 19 20 27 25 54 44
54 80 81 169 156 236 205
66.6666667 76.25 75.308642 84.0236686 83.974359 77.1186441 78.5365854
33.3333333 23.75 24.691358 15.9763314 16.025641 22.8813559 21.4634146
something#else.#The#‘no#dilution’#class#shows#the#least#number#of#calls.#The#calls#keep#on#rising#after#that.## # Sample 3 Coverage TP FP FN Passed Filtered Merged One#of#the#conclusions#that#we#can#make#is#that#it#is#not#probable#that#the#dilutions# No dilution 3.63 x 24 40 (X: 8) 11 71 14 43 that#we#perform#capture#the#whole#process#of#sequencing.#We#lose#some#data#that# Dilution A 0.45 x 24 38 (X: 7) 11 61 20 36 is#there#when#sequencing#a#sample#at#low#coverage.### Dilution B 0.45 x 27 38 (X: 9) 8 69 23 35 Dilution C 0.23 x 23 44 (X: 11) 12 72 18 46 # Dilution D 0.23 x 24 56 (X: 14) 11 78 21 54 A#large#number#of#false#positives#are#called#on#the#X6chromosome.#This#is#due#to#the# Dilution E 0.11 x 23 107 (X: 45) 12 127 27 100 fact# that# a# female# reference# is# used,# while# both# samples# are# male.# This# is# very# Dilution F 0.11 x 24 105 (X: 45) 11 130 23 100 important#to#notice,#when#using#a#male#reference,#it#is#probable#that#we#do#not#find# Sample 4 (r1) Coverage TP FP FN Passed Filtered Merged most#of#these#calls.## Passed Filtered Total Perc.passed Perc.filtered No dilution 1.24 x 17 29 (X: 7) 7 43 14 40 # Dilution A 0.35 x 12 49 (X: 11) 12 61 17 46 71 14 85 83.5294118 16.4705882 When# we# look# at# the# filtering# Dilution B 0.35 x 12 50 (X: 13) 12 59 21 43 step,# we# see# also# a# changing# 61 20 81 75.308642 24.691358 Dilution C 0.17 x 13 62 (X: 16) 11 79 21 61 point.# In# sample# KG# we# see# that# Dilution D 0.17 x 11 68 (X: 18) 13 82 20 64 69 23 92 75 25 from# 83%# passed# a# dip# to# 75%# Dilution E 0.09 x 14 209 (X: 39) 10 237 26 214 71 19 90 78.8888889 21.1111111 Dilution F 0.09 x 15 194 (X: 42) 9 207 27 186 and#back#to#83%.#This#is#not#seen# 77 22 99 77.7777778 22.2222222 in#in#the#other#sample.## Sample 4 (r2) Coverage TP FP FN Passed Filtered Merged # 127 27 154 82.4675325 17.5324675 No dilution 1.05 x 17 26 (X: 7) 7 37 17 28 In# absolute# terms# we# can# state# Dilution A 0.26 x 16 52 (X: 22) 8 62 18 47 128 25 153 83.6601307 16.3398693 that#detection#of#CNV#of#48#kb#is# Dilution B 0.26 x 13 54 (X: 28) 11 61 20 48 Dilution C 0.13 x 12 143 (X: 32) 12 142 27 129 possible# with# a# coverage# of# Dilution D 0.13 x 13 127 (X: 26) 11 132 23 119 39 18 57 68.4210526 31.5789474 0.43x,# as# this# is# the# lowest# Dilution E 0.07 x 14 187 (X: 53) 10 207 29 182 57 21 78 73.0769231 26.9230769 coverage# sample# that# we# use# in# Dilution F 0.07 x 12 165 (X: 42) 12 173 32 153 this# total# project.# The# dilution# 59 21 80 73.75 26.25 Sample 1 0.43 x 15 25 (X: 1) 12 43 17 30 experiments# show# that# lower# 78 22 100 78 22 Sample 2 0.51 x 16 23 (X: 0) 11 30 21 23 coverage#is#possible#(we#still#find# 82 20 102 80.3921569 19.6078431 a# large# number# of# TPs)# at# # 0.1# # x# 208 54 262 79.389313 20.610687 coverage.#The#number#of#false#positives#is#very#high#though.## samples. There does not seem times as high as the microarray reference. It can be observed that the # to be a large decrease in quality of 194 40 234 82.9059829 17.0940171 number of false positives do drop to a lower level, but not as low as the output (the number of false negatives) for a coverage between The#largest#jump#is#made#between#0.1160.23#and#0.0960.17#and#0.1360.26.#We#can#therefore#conclude#from# 0.1 and 0.35, but a large increase of false positives is observed. Note the optimal settings. The filtering deletes a larger part of the calls these#experiments#that#a#coverage#of#0.26#is#still#very#reliable.#This#is#a#lot#lower#that#0.43x.##
Table 4. Comparison of the different dilution sets for sample 3 and 4. The microarray pool is used as reference (R1, coverage of 1.3 x). Window size is 100 Sample#1#and#2#are#also#added#in#the#figure.#Here#we#can#see#that#for#sample#3#a#jump#is#visible#between#0.11#and#0.23.#Note#that#in#sample#3#the#number#of# reads per window, threshold of 0.25. The X: states the number of false positives that are called on the X-chromosome. There is a difference when adding passed#calls#is#already#higher#than#we#normally#expect#(71#as#opposed#to#39,#36,#42#and#29).#This#is#mainly#due#to#the#very#large#triplication.#Around#a#coverage# the TP, FP and FN and the passed and filtered calls. This difference is explained by the fact that a TP is counted only once, even if more than one NGS call of# 0.45# there# seems# to# be# good# results.# In# Sample# 3# you# see# a# dip# in# number# of# calls# at# this# coverage# and# then# again# a# rise.# Sample# 4# shows# in# both# runs# overlaps an aCGH call. Samples 1 and 2 are added for comparative reasons.
D.M. van Beek et al
Run 1 Run 2 R1 + R2
Run 1 Run 2 R1 + R2
Run 1 Run 2 R1 + R2
Microarray Ref. HC WS: 100 bp
Microarray Ref. HC WS: 350 bp
Microarray Ref. HC WS: 700 bp
12 10 13
18 18 18
22 22 22
17 17 18
TP
30 27 24
63 61 54
240 271 190
29 26 34
FP
12 14 11
6 6 6
2 2 2
7 7 6
FN
35 (64.8%) 29 (56.9%) 30 (62.5%)
60 (67.0%) 51 (59.3%) 51 (65.4%)
211 (75.1%) 240 (75.2%) 177 (76.6%)
43 (75.4%) 37 (68.5%) 45 (73.8%)
Passed
19 (35.2%) 22 (43.1%) 18 (37.5%)
28 (33.0%) 35 (40.7%) 27 (34.6%)
70 (24.9%) 79 (24.8%) 54 (23.4%)
14 (24.6%) 17 (31.5%) 16 (26.2%)
Filtered
27 19 22
47 36 40
161 171 131
31 28 34
Merged
MA Pool R1 MA Pool R2 MA Pool R3 MA Combined
17 14 17 18
29 21 35 63
7 10 7 6
43 31 44 58
14 14 17 30
31 22 36 47
17 12 18 18
26 16 61 61
7 12 6 6
37 28 43 52
17 14 24 34
28 17 29 36
60#
One#of#the#most#important#questions#for#the#clinic#is#if#a#reference#should#be#sequenced#in#multiplex#fashion#with#every#sample.#As#I#showed#that#a#slightly# higher#coverage#reference#(1.3x#is#tested#in#my#case)#generates#better#results#than#a#0.2x#coverage#reference,#this#would#mean#that#the#costs#increase.#Ideally#a# Reference Sample 4 Run 1 Sample 4 Run 2 reference#should#be#stored#and#should#be#reused#over#and#over#again.#This#means#that#we#need#to#understand#the#differences#between#samples#on#different# TP FP FN P F M TP FP FN P F M lanes#on#the#sequencer.##
nerated with ’MA Combined’ is done using a 350 reads/window setting. The ’MA Pool R3’ has a coverage that is about twice as high as runs 1 and 2, indow size was set at 200Multiplexing)of)the)control)sample)with)the)test)sample)–)important)or)not?) reads per window. The text colored red is the data of which we expect the best results due to multiplexing of this specific he test and reference sample. #
# # . True positives, false positives, false negatives, calls that passed (P) and failed (F) the rule-filtering for all combinations with the two runs from Sample # e 3 runs of the microarray # (MA) pool reference sample. The merged (M) column displays the calls that passed the filter after the merging step. The
Run 1 Run 2 R1 + R2
Microarray Ref. R1
Sample 4
# As#I#see#very#much#variances#in#window#sizes#and#coverage#related#to#this#window#sizes,#I#experimented#with#data#that#is#the#same#but#varies#in#coverage.#As# we# saw# in# the# dilution# experiments,# dilution# (probably)# does# not# capture# all# the# effects# of# sequencing.# I# used# the# data# that# is# sequenced# on# different# runs.# Sample#4/Daphne1#is#sequenced#(multiplexed#with#control#sample)#in#two#different#runs#(gives#us#two#different#runs#of#control#data#as#well).#We#also#have#an# Detection of CNVs using Shallow Whole Genome NGS Data to replace aCGH Analysis additional#control#sample#lane#that#can#be#used#as#totally#independent#to#both#sample#runs.#These#sets#are#put#against#each#other#using#DWAC6seq#100/0.25.## # The#results#can#be#found#below.#Because#it#immediately#became#clear#that#it#is#logical#that#the#use#of#a#high#coverage#control#has#an#influence#on#the#optimal# . The effect of the changewindow#size#we#set,#I#tried#to#simulate#this#by#simply#multiplying#the#window#size#by#3.5#(as#the#combined#microarray#reference#coverage#was#3.5#times#as#high# in coverage of sample and reference. As Sample 4 is sequenced in two runs and the reference is sequenced in three, the data as#to the# of# the# microarray# This# on resulted# in# similar# results# as#are the#made microarray# reference,# but# still#Sample a# higher# number# of# false# positives# could# be# used separately and merged seefirst# therun# effects of the changingpool).# coverage the results. Combinations using the two runs from 4 and observed.## roarray pool reference and the microarray pool reference HC is used (6.1). For the HC reference two different window sizes are used. #
Changes)in)coverage)of)test)and)control)sample))
Run 2 R1 + R2
10 13
27 24
14 11
29 (56.9%) 30 (62.5%)
22 (43.1%) 18 (37.5%)
19 22
17 14 17 18
TP
29 21 35 63
7 10 7 6
43 31 44 58
14 14 17 30
Sample 4 Run 1 FP FN P F 31 22 36 47
M 17 12 18 18
TP 26 16 61 61
7 12 6 6
37 28 43 52
17 14 24 34
Sample 4 Run 2 FP FN P F 28 17 29 36
M
)
e manually annotated. RESULTS??
or telomere will mostly consist of empty bins, the call will be eliminated in the post-processing step described in section 2.8. Allowing mismatches to occur when aligning reads to the Control)sample)versus)control)sample) reference genome, will result in less data loss when aligning in a SCUSSION # more strict fashion. As our goal is to apply shallow sequencing, we Ideally#no#CNVs#should#be#detected#when#running#the#reference#versus#the#reference.#DWAC6seq#run#with#reference#versus#reference#should#give#no#results,# Reference want to lose as little data as possible. As only a few samples were and#this#is#true!#No#results#are#returned.## icroarray reference used in the analyses is of commercial available for analysis, the decision to allow one mismatch was based # nd is suitable because it# contains a pool of 100 test subjects, largely on the fact that a mismatch allows for SNPs to occur. The reduces the influence of individual CNVs that may occur analysis itself showed a small preference for this setting, but having # of the subjects. As seen in section 2.5, the reference more samples will make this conclusion more stable. s inconsistencies, but the inconsistencies can be overcome ng the read-depth data plots as well as the log2 ratio. 3.3 Read-tower removal lls that originate from the reference sample will be easily 61# focused izable. More ideal is the construction of a new reference. As After alignment one of the filtering steps to remove bias was itage of the test subjects in the reference pool is not known, on the reduction of the influence of large stacks of reads that were
# # As#multiple#factors#can#influence#the#process#of#sequencing,#we#expect#that#the#best#results#will#be#possible#when#the#test#sample#and#the#control#sample#are# ges in temperature or chemical composition of the genetic it could be possible that the test subjects chosen were not diverse sequenced#in#the#same#lane.#The#setting#in#which#the#corresponding#test#and#control#sample#are#multiplexed#are#colored#red#in#the#above#table.#We#expect#that# l that needs to be sequenced. These biases occur all over the enough, resultingin#insequencing# populationconditions# stratification. newly constructed multiplexing# reduces# biases# that# can# occur# due# to# differences# such# In as# adifferences# in# temperature# or# chemical# composition# of# the# e, are not very location-specific and ill-understood. In Table reference a variety of people should be considered. genetic#material.#Other#bias#can#result#from#GC6content.#GC6rich#regions#are#known#to#influence#the#number#of#reads#that#are#sequenced#in#a#region#(Benjamini# ample 4, run 2 the exact opposite of what we expect can and#Speed#2012).#In#the#table#above#for#sample#4 R2#the#exact#opposite#of#what#we#expect#can#be#observed:#a#higher#number#of#false#negatives#is#observed#than# erved: a higher numberwhen#using#a#control#sample#from#another#sequencing#run#(so#no#multiplexing).#Sample#4 of false negatives than when using R1#does#show#the#results#as#we#expect:#low#number#of#false#negatives# 3.2 Alignment and#relatively#low#number#of#false#positives#for#the#case#where#the#control#sample#is#multiplexed#with#the#test#sample.#As#stated#before,#the#number#of#TP,#FP# r reference. The first run does show the results as we expect: When the coverage of the test sample is largely deviating from the mber of false negatives and#FN#are#based#on#aCGH#analysis#and#is#not#very#reliable.#When#looking#at#the#results#of#the#rule6filtering,#for#both#samples#the#least#data#is#deleted#when# and the number of false positives is using#control#sample#A and#the#lowest#number#of#calls#are#passed.#After#merging#the#number#of#calls#for#sample#4 #in#combination#with#control#sample#AR2#is# reference sample, the performance of the read-tower filtering isR2less y high. As stated before, the number of TP,R2#FP and FN are very#low,#even#lower#than#the#number#of#aCGH#calls#that#are#made.## reliable. There are calls made on the centromere and telomeres of n aCGH analysis and is not very reliable. When looking at # the chromosomes, because of the large difference in read-depth ults of the rule-filtering,We#conclude#that#the#advantages#of#the#multiplexing#of#the#test#and#control#samples#can#not#be#directly#derived#from#the#table.#It#seems#that#the#use#of#a#non6 for both samples the least data is of the and reference samples. This difference amplified bysequencing# on# different# runs# is# not# yet# when using microarraymultiplexed# pool R2 and the lowest number of control# sample# can# perform# as# good# as# test a# multiplexed# sample.# As# the# extend# of# the# bias#is resulting# from# the read-tower filtering step. As the calls made on a centromere e passed. As we do notknown#in#detail,#more#data#should#be#generated#to#validate#the#necessity#to#multiplex#test#and#reference#sample.## know the quality of the calls, the
MA Pool R1 MA Pool R2 MA Pool R3 MA Combined
Reference
. True positives, false positives, false negatives, calls that passed (P) and failed (F) the rule-filtering for all combinations with the two runs from Sample e 3 runs of the microarray (MA) pool reference sample. The merged (M) column displays the calls that passed the filter after the merging step. The # erated with ’MA Combined’ is done using a 350 reads/window setting. The ’MA Pool R3’ has a coverage that is about twice as high as runs 1 and 2, The#last#sample#that#I#got#was#sequenced#in#two#runs,#both#multiplexed#with#a#sample.#I#could#easily#combine#the#data#sets#to#see#if#there#are#differences#in# indow size was set at 200 reads per window. The text colored red is the data of which we expect the best results due to multiplexing of this specific output.#The#experiment#can#be#seen#in#the#table#below.## e test and reference sample. #
Ref. HC WS: 700 bp
References (2010). User Guide Megapool Reference DNA EA-‐100M (male) & EA-‐100F (female). Vlierweg 20, 1032 LG Amsterdam. 1000 Genomes Project (2009, 10th October 2009). "Sanger FTP 1000 genomes Reference." Retrieved 22nd February, 2012. Abyzov, A., A. E. Urban, et al. (2011). "CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Research 21: 974-‐984. Agilent Technologies. "Agilent Scan Control 8.5.1." Agilent Technologies. "Feature Extraction 11.0." Alkan, C., B. P. Coe, et al. (2011). "Genome structural variation discovery and genotyping." Nature Reviews Genetics 12: 363-‐376. Bailey, J. A., Z. Gu, et al. (2002). "Recent Segmental Duplications in the Human Genome." Science 297(5583): 1003-‐1007. Benjamini, Y. and T. P. Speed (2012). "Summarizing and correcting the GC content bias in high-‐throughput sequencing." Nucleic Acids Research. Biodiscovery, I. (2011). "Nexus Number\texttrademark Users Manual."
Copy
Campbell, P. J., P. J. Stephens, et al. (2008). "Identification of somatically acquired rearrangements in cancer using genome-‐wide massively parallel paired-‐end sequencing." Nature Genetics 40(6): 722-‐729. CERN ROOT Package.
Chiang, D. Y., G. Getz, et al. (2009). "High-‐ resolution mapping of copy-‐number alterations with massively parallel sequencing." Nature Methods 6(1): 99-‐103.
ENZO Life Sciences, I. Product Data Sheet Enz-‐ 42671 CGH labeling kit for oligo arrays. Feuk, L., A. R. Carson, et al. (2006). "Structural variation in the human genome." Nature Reviews Genetics 7: 85-‐97. Gusnanto, A., H. M. Wood, et al. (2012). "Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-‐generation sequence data." Bioinformatics 28(1): 40-‐47. Holstege, H. (2010). "Gebruik van array in prenatale diagnostiek in Nederland 2010." Hunter, J., D. Dale, et al. Matplotlib.
Illumina, I. (2010). De novo assembly using Illumina Reads. Illumina, I. (2011). TruSeq(\texttrademark) RNA and DNA Sample Preparation Kits v2. Klambauer, G., K. Swarzbauer, et al. (2012). "cn.MOPS: Mixture of Poissons for Discovering Copy Number Variations in Next Genration Sequencing Data." Nucleic Acids Research 40. Korbel, J. O., A. E. Urban, et al. (2007). "Paired-‐ End Mapping Reveals Extensive Structural Variation in the Human Genome." Science 328: 420 -‐ 426. Koval, S. and V. Guryev (2011, February 10, 2011). "DWAC-‐seq Dynamic Window Approach for CNV detection using sequencing tag density." Retrieved April, 17, 2012. Lam, H. Y. K., M. J. Clark, et al. (2012). "Performance comparison of whole-‐genome sequencing platforms." Nature Biotechnology 30: 78-‐82. Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows-‐Wheeler transform." Bioinformatics 25(14): 1754 -‐ 1760.
Darvishi, K. (2010). "Application of Nexus Copy Number Software for CNV Detection and Analysis." Current Protocols in Human Genetics.
Li, H., B. Handsaker, et al. (2009). "The Sequence alignment/map (SAM) format and SAMtools." Bioinformatics 25: 2078 -‐ 2079.
Ensembl (2012, January). "Ensembl Human Genome Sequence FTP."
Li, R., Y. Li, et al. (2008). "SOAP: short oligonucleotide alignment program." Bioinformatics 24(5): 713-‐714.
62
Li, R., C. Yu, et al. (2009). "SOAP2: an improved ultrafast fool for short read alignment." Bioinformatics 25(15): 1966-‐1967. Lo, Y. M., N. Cobetta, et al. (1997). "Presence of fetal DNA in maternal plasma and serum." The Lancet 350(9076): 485 -‐ 487. McCarroll, S. A. and D. M. Altshuler (2007). "Copy-‐number variation and association studies of human disease." Nature Genetics Supplement 39: S37-‐S42. Medvedev, P., M. Stanciu, et al. (2009). "Computational methods for discovering structrual variation with next-‐generation sequencing." Nature Methods Supplement 6(11s): s13 -‐ s20. Pinkel, D. and D. G. Albertson (2005). "Comparative Genomic Hybridization." Annual Review Genomics and Human Genetics 6: 331-‐ 354. Pinkel, D., R. Segraves, et al. (1998). "High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays." Nature Genetics 20: 207 -‐ 211. Robinson, J. T., H. Thorvaldsdóttir, et al. (2011). "Integrative Genomics Viewer." Nature Biotechnology 29: 24 – 26. Smeets, S. J., U. Harjes, et al. (2011). "To DNA or not to DNA? That is the question, when it comes to molecular subtyping for the clinic!" Clinical Cancer Research 17: 4959-‐4964.
Vissers, L. E. L. M., B. B. A. de Vries, et al. (2009). "Genomic microarrays in mental retardation: from copy number variation to gene, from research to diagnosis." Journal of Medical Genetics 47: 289-‐297. Wood, H. M., O. Belvedere, et al. (2010). "Using next-‐generation sequencing for high reslution multiplex analysis of copy number variation from nanogram quantities of DNA from formalin-‐fixed paraffin-‐embedded specimens." Nucleic Acids Research: 1-‐11. Xi, R., A. G. Hadjipanayis, et al. (2011). "Copy number variation detection in whole-‐genome sequencing data using the Bayesian information criterion." Proceedings of the National Academy of Sciences 108(46): E1128-‐E1136. Xie, C. and M. T. Tammi (2009). "CNV-‐seq, a new method to detect copy number variation using high-‐throughput sequencing." BMC Bioinformatics 10(80). Yoon, S., Z. Xuan, et al. (2009). "Sensitive and accurate detection of copy number variants using read depth of coverage." Genome Research 19: 1586-‐1592.
Straver, R. (2012). Literature Study: Detection Of Fetal Genetic Disorders Using Maternal Blood And Next Generation Sequencing. Straver, R., H. Holstege, et al. (2012). "Detection of Fetal Copy Number Aberrations by Shallow Sequencing of Maternal Blood Samples." Unpublished. Tucker, T., A. Montpetit, et al. (2011). "Comparison of genome-‐wide array genomic hybridization platforms for the detection of copy number variants in idiopathic mental retardation." BMC Bioinformatics 4(25). van de Wiel, M. A., R. Brosens, et al. (2009). "Smoothing waves in array CGH tumor profiles." Bioinformatics 25(9): 1099-‐1104.
63