Sequence dependent UV damage of complete pools of oligonucleotides

Understanding the sequence-dependent DNA damage formation requires probing a complete pool of sequences over a wide dose range of the damage-causing exposure. We used high throughput sequencing to simultaneously obtain the dose dependence and quantum yields for oligonucleotide damages for all possible 4096 DNA sequences with hexamer length. We exposed the DNA to ultraviolet radiation at 266 nm and doses of up to 500 absorbed photons per base. At the dimer level, our results confirm existing literature values of photodamage, whereas we now quantified the susceptibility of sequence motifs to UV irradiation up to previously inaccessible polymer lengths. This revealed the protective effect of the sequence context in preventing the formation of UV-lesions. For example, the rate to form dipyrimidine lesions is strongly reduced by nearby guanine bases. Our results provide a complete picture of the sensitivity of oligonucleotides to UV irradiation and allow us to predict their abundance in high-UV environments.


Supplementary Information
SI-1: Methods and Materials SI-2: Normalization of the raw data SI-3: Improving statistics by averaging over tail sequences and positions SI-4: Exponential length dependence of the survival probability SI-5: Validation for sequence-averaging over all positions SI-6: Sequence symmetry of survival probabilities SI-7: Absorbed Dose SI-8: Molecular Model for Dose-Dependence SI-9: Molecular Model for the Damage Coefficient of Oligomers SI-10: Fitting of damage rates SI-11: Error propagation SI-12: Example for data handling SI-13: Detection of outliers SI-14: Analysis for the predominant occurrence of single-stranded DNA SI-15: Determination of the damage yield for GG-dimers by ultraviolet spectroscopy SI-16: Assessment of the sensitivity of the method for detection of 8-oxoguanine SI-17 Characteristic data on the dose dependences for tetramers and hexamers SI-t 1: Values of extinction coefficients used for the calculation of the absorbed dose SI-t 2: Absolute frequencies for the raw data set SI-t 3: Absolute frequencies of the central hexamer SI-t 4: Absolute frequencies of the dimers SI-t 5: Relative frequencies of the dimers SI-t 6: Relative frequencies of the dimers SI-t 7. Relative frequencies of the dimers after correction SI-t 8. Decay parameters for tetramers SI-t 9. Hexamer sequences with largest and smallest survival probabilities of the enzymes used to detect the desired type of damage should be studied. After the discrimination step, only undamaged DNA should be successfully indexed for sequencing.
Step 4: Sequencing, determination of the oligomer frequency (2 nd dimension). Only completely prepared DNA, i.e. strands with fully ligated index adapters, can be detected quantitatively by NGS. More frequent sequences generate more readout events, which allows the determination of the relative number of all sequences within a sample. Since the technique relies on the number of surviving oligomers, the sequencing procedure should be performed in a way to yield a sufficiently large number of counts for the different sequences in the pool.
Step 5: Analysis and quantification. The different steps in sample preparation, amplification and analysis may be subject to fluctuations in relevant parameters such as pipetting error or sequence dependence of used enzymes. These effects alter the total and relative number of oligomers with certain sequences. In addition, the synthesis, library preparation and the sequencing itself can be sequence dependent. These effects have to be considered before the dose dependence of the survival of oligomers can be quantified. An example is given in Figure SI-f1, where the average number of oligomers is plotted for the unirradiated sample, i.e. D = 0, without any corrections. The observed frequencies vary by several orders of magnitude with a preference of oligomers containing mainly adenosine. To correct these variations, the frequencies of the sequences from the irradiated sample should be normalized with the corresponding frequencies of the sequences from the unirradiated sample (see next Chapter). Sequence-independent deviations between different samples can additionally be corrected by further normalization to a supposedly known frequency of a certain "lighthouse sequence". This could be, for example, a sequence that is not damaged by UV light at all, or whose damage rate could be determined by an independent measurement using e.g. UV spectroscopy. Details of the correction process and the underlying equations are presented in the next chapter.

Materials and techniques
DNA randomers of the sequence ACACNNNNNNNNACAC and 8-oxoG modified sequences (see SI-16) were synthesized by a commercial provider (Biomers, Germany), whereby all four canonical nucleotides were used in equal amounts in the synthesis steps of the central 8-mer. Upon receipt, the samples were dissolved in 1x PBS (137 mM NaCl, 2.7 mM KCl, 10 mM Na2HPO4, and 1.8 mM KH2PO), diluted to a base concentration of 1 mM, vortexed, and centrifuged.
The final DNA concentration was checked by Nanodrop (Thermofischer, US) and UV-vis measurement (Shimadzu UV1800). Due to the low absolute concentration per sequence of 1 nM and the fact that Watson-Crick base pairing can occur mainly in the comparably short central 8-mer, less than 0.1 % DNA duplexes are expected at room temperature. Therefore, the expected effect of reducing the susceptibility to damage in the double strand compared to a single strand is negligible (see SI-13). Oligonucleotides GGT=TGG, AGT=TGA, GAT=TAG and ACT=TCA containing the CPD-damage T=T were purchased from IBA GmbH, Germany. These oligonucleotides (and the corresponding intact oligomers at the same concentrations) are directly introduced in Step 3 of the procedure and the absolute frequencies of the intact sequences are recorded after Step 4. For the UV-exposure, 3.45 ml of the sample were filled into a fused silica cuvette (Type 117100F-10-40, Hellma, Germany) with a pathlength 10 mm, which was then placed in a temperature-controlled socket. Magnetic stirring was used to ensure a homogeneous sample at a temperature of 22°C. The exposure itself was carried out with a Nd-based laser system (AOT-YVO-25QSP/MOPA) from Advanced Optical Technology, UK, (repetition rate of 6.5 kHz, average power ca. 20 mW, wavelength 266 nm, beam diameter in the sample ca. 1 mm). The laser power before and after the sample cuvette was monitored by power meters (Ophir, Israel) to determine the power absorbed in the sample (for details of the determination of the absorbed dose see SI-7). After a desired dose was reached, the exposure was stopped briefly and 50 µl sample was taken from the cuvette, frozen in and stored at -80°C. Illumination and sample collection was repeated until a maximum average dose of ca 500 photons per base was reached. The samples were then prepared for sequencing according to the protocol of the Swift Accel-NGS 1S DNA Library Kit (Swift Bioscience, US), taking care to process the samples immediately after thawing. The sequencing was then performed using a Hi-Seq sequencer (Illumina, US) and the data was then analyzed using the methods given in the main text and in the Supplementary Information SI-1 to SI-14. The determination of the dose, i.e. of the number of absorbed photons per base (see SI-7) requires knowledge on the number of oligomers in the sample cell and on the extinction coefficients of the different bases in the oligomers at the irradiation wavelength of 266 nm. For this wavelength the extinction coefficients of the four bases A, C, G and T are similar. For simplicity of the data evaluation we assume that the extinction coefficients are the same. This assumption together with the uncertainty of the determination of the absolute extinction coefficients leads to a relative error in the determination of the irradiation dose of D/D = 0.16. When decay coefficients of the fit curves or quantum yields Φ are determined, the total error is due to the error of the fit procedures and the error in the determination of the dose values.

SI-2. Normalization of the raw data
The presented technique in the main document and in SI-1 records the action of an external exposure -in the present case UV irradiation -on oligonucleotides in a highly parallel way. The parallel nature is obtained by combining oligomers with defined length but arbitrary sequences and second-generation sequencing leading to a 2D-data set. Details of the different steps of the data handling procedure are (see also main text and SI-1): (1) DNA oligomers of a defined length with tails of (sequence ACAC, in some cases GTTG) at the 5' and the 3' end contain in the center (eight) bases with arbitrary sequences. I. e. the DNA synthesis is performed with equal concentration of the four nucleotides A, C, G or T. Possible sequence biases during synthesis will be addressed (see below). The tails are used to identify the intact oligomers after the sequencing process. In addition, they may serve to avoid double-strands in the oligomer solutions. The sequence ACAC for the tails is chosen since it contains no damage-prone bi-pyrimidine and should survive UV-irradiation doses of considerable strength. The complete 16-mer solution is split after the irradiation into = 18 individual samples. These sample have the same volume and oligomer concentration. For an arbitrary sequence i, within the precision of the synthesis and the splitting process, they should contain the same number (0) ( ) of oligomers with central sequence .
(2) In the second step of the experiment the different aliquots are treated by specific amounts of UV-irradiation. In this process a fraction of the oligomers undergoes radiation damage, depending on the absorbed dose (in our case dose is defined as absorbed photons per base). In this process the number of intact oligomers with sequence i in sample j decreases due to absorption of dose to (1) Here ( , ) is the survival probability of oligomer exposed to dose . This quantity contains the information on the dose dependence of the oligomers with sequence .
(3) In the subsequent treatment steps (library preparation leading to discrimination), intact oligomers are transformed into structures which enter the subsequent amplification and analysis steps (PCR and bridge amplification) of the Illumina process. The damaged structures are assumed to be rejected in these processes. The parameters of rejection and amplification may vary from sample to sample leading to an efficiency factor , which is assumed to be independent of the specific sequence. Simultaneously the amplification steps may also depend on the sequence, which is considered by the sequence dependent efficiency factor and is the same for all samples. Thus, the final number ( ) ( , ) of intact oligomers after amplification should become: It should be noted that ( ) ( , ) refers here to the complete 16-mer with eight bases in the tails and the central randomer. The sought-after quantity of the investigation is the survival probability as a function of the sequence and the dose, ( , ). However, the efficiency factors and , which may exhibit considerable variations with and , also enter Equation (SI-e 2). They may hide the relation between the original oligomer frequencies (0) ( ) and ( , ). Thus ( , ) can only be obtained after elimination of , and the sequence dependence of original oligomer frequencies (0) ( ) by appropriate normalization procedures. The pronounced variations with and are evident from the plot of the recorded data ( ) ( , ) plotted versus index representing the sequence of the central octamer (Figure SI-f 1 or Table SI-t 2).   Samples subject to remarkably high irradiation doses show the expected strong reduction in oligomer numbers ( Figure SI-f 2). This damage can be caused by the central, randomized bases as well as by the bases in the tails. In order to demonstrate the dose dependence, we plot in Figure SI-f 3 the absolute frequencies averaged over all sequences as a function of the applied dose. The data points at the higher dose values (red squares, lower scale) clearly show a strong decay with increasing dose values. The over-all decay is not mono-exponential and occurs on the dose-range of 50 photon/base. The points at smaller doses values (violet open squares, top scale) exhibit that in addition to the general trend (decay with increasing dose) there are fluctuations, which may result from the variations in the preparation/analysis conditions for the different samples, represented by the efficiency factor . These results indicate that normalization procedures are required to obtain the correct survival probability ( , ) from the recorded data ( ) ( , ). For the elimination of the general sequence dependent efficiency factor (product (0) ( ) ⋅ ) we divide the recorded oligomer numbers ( ) ( , ) by the oligomer numbers obtained at dose 0 = 0, where the survival probability is, by definition, set to ( , 0) = 1. These normalized data directly reflect the sequence dependence of the quantity ( , ). After this division, the remaining preparation/analysis dependent factor 0 may be eliminated when additional information is available for the dose dependence of a specific sequence 0 or of a set of sequences. In the presented case of UVirradiation damage one knows from literature that strong damage occurs for sequences containing di-pyrimidine steps (especially TT) while all-purine sequences should show weak dose dependences. An inspection of the data shows that poly-G sequences are among the sequences with the slowest decay (for additional information see chapter SI-15). When we assume for a certain sequence 0 (e. g. for poly-G in the central octamer) a defined survival probability ( 0 , ) = 0 ( ) we can eliminate 0 in SI-e 3 and obtain: (SI-e 4) Until now we treated the absolute and relative frequencies of the complete 16-mers. For a better understanding of sequence selective radiation damage, we want to consider shorter oligomers where all possible sequences are present in a chain with random contacts. In addition, we want to improve the quality of the data by averaging over the remaining parts. Therefor we split our investigated 16-mer sequence into 3 parts. The left and the right regions (sequence are and ) and the central region .
The central part which will be considered in the analysis shall only contain up to 6 random bases N, which are not adjacent to the nucleotides of the ACAC tails. Thus, the ACAC tails should have negligible influence on the central part.
For the elimination of parts of sequences (e. g. left or right part), which should not be considered in the analysis, we use a factorization of the survival probability of the 16mer sequence.
(SI-e 6) ( , ) = ( , ) ⋅ ( C , ) ⋅ ( , ) Such a factorization is correct in situations where the survival probabilities of the different part are independent. Within the molecular model of damage formation with monomeric and dimeric damage (see SI-9) factorization is also possible. We apply (SI-e6) also for the reference sequence 0 , which is chosen in a way that and 0 differ only in the central parts and ,0 , but have the same sequences in the tails, = ,0 and = ,0 . When these relations are used in equation SI-e 4 the survival probabilities of the tail parts, which do not depend on the sequence of the central part cancel and a relation for the survival probability of the central part results, where Assuming that UV-damage involves only monomer and dimer lesions (see chapter SI-9) the central sequence remains intact only if all possible dimers involving bases from the central sequence remain undamaged. Therefore, the survival probability does not only refer to dimers that only involve bases from the central part, but also to dimers that may be formed with the random base directly adjacent to the central sequence (see above).
In practical applications the recorded frequencies ( ) ( , ) are rather small, which leads to considerable uncertainties in the values ( − ) ( , ). To improve the statistics, equation SI-e 8 can be averaged over sequences of the left and right tail. For this reason, we sum just ( ) ( , ) over the tails or over all possible positions of the sequence (see next chapter) before computing ( − ) ( , ). This procedure strongly reduces the uncertainties but leads to an increased influence of tail sequences with large frequencies. However, since the tails are chosen to yield survival probabilities independent of the central sequence, the procedure should not lead to systematic changes of ( , ). The procedure described in this chapter directly provides the survival probability as a function of the applied dose for a sequence embedded in random bases. When the damage quantum yield Φ for an isolated dimer XY is required, the fit of the survival probabilities ( , ) of the different sequences can be used to determine all required values Φ (see chapter SI-9, equation SI-e 47). In the next chapters we present an alternative method to determine the survival probabilities for isolated oligomers. One will find that both methods lead to the same quantum yields for dimeric damage.

SI-3 Improving statistics by averaging over tail sequences and positions
As already mentioned in the last chapter, the statistical error that arises in the case of small absolute frequencies can be reduced by averaging. In this chapter we consider two possible approaches. If we look at a certain center sequence of interest of length k within our total strand at position , we can on the one hand average over all sequences that have this sequence at position . On the other hand, it is worth considering extending this by averaging over different positions if it can be assumed that the absolute position of the sequence within our overall strand is irrelevant. We start with averaging over all possible sequences and in the tails but keep constant the central sequence at a position . The values labelled with an asterisk indicate this frame-averaging procedure. Where denotes the sample id. ( ) was already defined in a previous chapter and indicates the experimentally found absolute frequencies for sequence in sample .
Analogous to the correction for the sequence and sample dependencies already considered above, the frame-averaged absolute frequency ( , * ) can be corrected by a factor and for the respective systematic errors, thus obtaining the corrected frame-averaged absolute frequency (1, * ) . (0, * ) is the corresponding frequency for the unexposed reference samples = 0. It should also be mentioned that the factor is not trivially related to the factor discussed in the last chapter for the correction of sequence dependencies within a sample. The justification for factoring out this factor results from the experimental observation that within a sample the relative frequency of frame-averaged sequences is reproducibly in a fixed relationship to each other.

Spatial averaging
To further improve the statistics an averaging over the different positions can be performed. This averaging is justified if the respective position of in the sequence is irrelevant. This is proven experimentally in as shown in Figure SI ̅ * is still too specifically defined and is not yet suitable for the general determination of survival or damage probabilities for any sequences. In SI-e 11 cancels but the dependence on the respective sample 0 has not yet been corrected. This will be done in the following section.

Calculation of sub-mer damage
In order to obtain the survival probability of a certain isolated sequence and not only the rather special probability of survival of DNA strands of length 16 with the edge sequences ACAC, which contain , we can use the factorization approach again. We divide ̅ * into three sub-probabilities (see Figure SI-f 4). The key quantity is the survival probability ̅ * ( , ) of the isolated sequence under the dose in the arbitrary central part of the oligonucleotides. The survival probability of the first and last arbitrary base N is biased due to the vicinity of the frame sequences ACAC at the 5' and 3' end of the sequence. Therefore, the survival probabilities of the frame subsequences ACAC and one adjacent N (arbitrary base) left and right of the sequence are combined to ̅ * ( , , ) and ̅ * ( , , ) , respectively. Analogous to the previous chapter the probabilities are assumed to be independent. This results in: Note that the quantities ̅ * , ̅ * and ̅ * are not themselves averages over the frame sequences and position, but the asterisk and the bar here indicate that they were derived from a corresponding average value to differentiate them from the corresponding quantities in chapter SI-2. Since we are entering the range of several hundred photons/base exposure dose in our experiments, this approximation only allows the determination of the initial damage rates, as shown later by corresponding fits. Repair rates, which can only be determined via the further course and the equilibrium survival probabilities for → ∞, can therefore not be determined with this method.
To obtain the isolated survival probability of , it is necessary to determine ̅ * and ̅ * beforehand. This can be achieved by treating the special case where corresponds to only one single base = 1 . Thus , and , are identical and equal to 1 . The left sub strand is assumed to be bases longer than the right sub strand. The position of 1 does not influence ̅ * as already discussed, but it is included in ̅ * and ̅ * with respect to . One obtains from equation SI-e 12: (SI-e 13) which can be reduced to where the averaged survival probability ̅ , * is defined by: where is half the difference between ̅ * and ̅ * : 2ϵ: = | ̅ * − ̅ * | . To obtain equation SI-e 14, we assume that the survival probability of the sequence-averaged edge parts decreases exponentially with their length. This is plausible in so far as with every nucleotide added there is an additional possibility of damage. The validity of the assumption is also shown experimentally in Chapter SI-4 from the available sequence data. If the decrease in the probability of survival from an n to an (n+1)-mer is now designated Δs, the length of the edge parts, which initially differs by k as assumed, can be brought to the identical length by multiplication with − . This way one gets the survival probability for an edge strand of identical length, which starts or ends with 1 .
For sufficiently small , therefore, SI-e 13 follows directly from SI-e 14. is also dependent on the sequence or base 1 , but can be estimated by considering an upper limit value, by comparing the CPD quantum yields for the dimers TC and CT as shown in 2 . Here, a minimal base sequence "C" has a thymidine-nucleotide either at the 3' or 5' end and was irradiated with UVB light, showing a difference in quantum yield of approximately 0.4 ⋅ 10 −2 per photon and base. This corresponds to a maximum error 2~2 ⋅ 10 −5 , which accordingly can be neglected in further considerations. In this work, the framing sequence has a length between 4 and 7, which reduces the numerical value of even more. In equation SI-e 14 we can set the survival probability of the isolated monomer ̅ * ( 1 , ) = 1 without loss of generality, because we assume that only dimer damage leads to the termination of the polymerase and thus to a detectable signal (see chapter SI-1 and SI-2). Thus, we obtain from equation SI-e 12 and SI-e 14: (SI-e 16) .
The quantities ̅ * on the right side of the equation can all be determined via equation SI-e 11, whereby the sample bias 0 cancels out and only Δs needs to be determined: The sequence independent value of Δ can be found by consideration of a presumably undamaged sequence for which the survival probability is, by definition ̅ * ( , ) = 1. In our analysis, we select the least damaged dimer sequence found by downstream processing of the sequencing data which reveals the oligomers with the lowest drop in sequence frequency upon irradiation. For this sequence of length , we get: Together with SI-e 17, SI-e 18 determines the normalized isolated survival probability for a sequence . This isolated survival probability can be calculated with the measured sequence frequencies from next generation sequencing as it only includes values of ( ) , see Figure SI-f 5. It should be mentioned here that the sequence used in SI-e 17 should not contain the edge sequences ACAC and the respective adjacent N (arbitrary base), since otherwise a strong sequence dependence would be induced, which would falsify the calculated damage rates.

SI-4 Exponential length dependence of the survival probability
To experimentally validate Equation SI-e 16/17 and SI-e 14 and therefore the assumption of an exponential dependence of survival probability on the sequence length, we average over all possible dimer sequences that must not contain parts of the edge sequences ACAC and the respective adjacent N (arbitrary base): The left side averages over the survival probabilities of all possible dimer sequences and hence can be described as the mean dimer survival probability. If our assumption is correct and the change in survival probability due to an additional dimer is described on average by an additional factor Δ , and thus the overall survival probability of the strand decreases exponentially with length, Δ would have to correspond to the average survival probability of a dimer Δ = ∑ ̅ * ( , ). The experimental values of the sum on the right-hand side of Equation SI-e 19a should therefore be close to unity: Figure SI-f 6 shows the left-hand side of SI-e 20 of all samples is indeed close to unity within 3% which validates the assumption of an exponential length dependence of survival probability reasonably well.

SI-5 Validation for sequence-averaging over all positions
Especially long exposure times and therefore high dosages decrease the number of intact strands and sequence frequencies which leads to low signal to noise ratios.  shows only a small difference of 1.1% and 0.2% for run 1 and 2, respectively. Since the deviation is well below the statistical error of ~6%, averaging of the survival probabilities over all positions is performed to increase the signal to noise ratio without masking position-dependent effects on the damage rates.

SI-6 Sequence symmetry of survival probabilities
The assumption of sequence symmetric damage can be made if the relative survival probabilities ̅ * ( , ) of a sequence is equivalent to the survival probability of its reversal sequence , e.g. = , = . Analysis of this correlation for the obtained sequencing data sets shows a clear dependency ̅ * ( )~̅ * ( ), as the mean square displacement over different sequences , 1 ∑ ( ̅ * ( ) − ̅ * ( )) 2 is below 0.2% for hexamers (see Figure

SI-7 Absorbed Dose
The average number of absorbed photons per base, we denominate it the dose D, can be determined as the ratio of the numbers of absorbed photons and the number of bases Nbase in the sample. We use this number to characterize the excitation of the sample. In the irradiation experiment the power of the UV-radiation is measured by calibrated power meters for the radiation impinging on the sample and transmitted through the sample. Together with the known coefficients of reflection at the sample windows these values allow to calculate the power Wabs absorbed in the sample. The numbers of absorbed photons Nabs is obtained by integration over the irradiation time and division by the photon energy h. For the irradiated 16-mer ACACNNNNNNNNACAC the randomized bases N are considered by using the parameters , , , , , , obtained by appropriately averaging the coefficients , , given in Table 2    The extinction coefficient of the oligomer (266) can be calculated via the nearest neighbor model with the parameters determined at a wavelength of 266 nm. These parameters may be obtained from the parameters given at 260 nm combined with the absorption properties of all possible DNA dimers. An evaluation of the dose DOlig for a series of tetramers indicated that the dose DOlig of adenine-rich oligomers is up to ca. 15% above the average dose D, while adenine-poor oligomers have dose values of ca. 5 % below the average dose D. These numbers indicate that the average dose D yields a good measure for the qualitative determination of dose dependences of different oligomers.

SI-8 Molecular Model for Dose-Dependence
Here, we consider a sample containing many oligomers of different types which will be exposed to UV-radiation. The oligomers in the sample have the same length L. In addition, we assume that all oligomers have the same extinction coefficient (when the difference in extinction coefficients should be considered explicitly, the individual dose values of the different oligomers from Section SI-7, equations SI-e 27, 28, have to be used below instead of the average value D·L). We assume that the absorption of the sample should not change during our irradiation experiment. This is justified for the irradiation doses used in the experiment. The damage of a single dimer in the 16-mer is sufficient to destroy the integrity of the 16-mer but changes only the absorption of a small fraction of the absorbing bases. Indeed, even at the highest irradiation doses used in this paper, the absorption of the sample is changed by less than 20%. In the following we consider one specific type of oligomers with sequence in the sample containing many other oligomers. The frequency of the specific sequence of type of intact oligomers in the sample after irradiation with a dose is denominated ( ). Prior to irradiation we have (0) = (0) ( ) (see Equation (SI-e 1)). Here, we use the denomination ( ) to differentiate from the various measured and normalized quantities H used in section SI-2. When a small dose Δ , Δ ˂˂ 1 ℎ / , is applied to the sample, a fraction of the oligomers will be excited. Δ * , the number of oligomers of type excited by the dose Δ , is proportional to the number of intact oligomers present at the time of irradiation times the number of photons absorbed by the oligomer ( ⋅ Δ ). gives a qualitative measure for the dose resistance of certain oligomer . Quite often the recorded data show dependences, which deviate from a single exponential form (see Figure 3, main text). This may be due to a more complex reaction scheme. Equation (SI-e 32) represents the reaction out of a defined state, describing the intact oligomer. The UV radiation damage converts the intact oligomer into the damaged one. There is no further reaction involved. Especially there is no backreaction (see reaction scheme in SI-f 9a). In pyrimidine dimers however, several types of photolesions are possible. An example for a reaction scheme describing these processes is given in SI-f 9b, the corresponding rate equation system (SI-e 34). For the CPD-lesion we know that appropriate UV-irradiation can reestablish the undamaged, intact oligomer (damage Type 2, frequency 2 ). For other lesions, e. g. the (6-4)-lesion a direct back-conversion to the intact oligomer is does not occur (damage Type 1). The faster decay constant ( ) is visible at low doses, the slower ( ) at high dose levels. When the system starts initially from intact oligomers with frequency (0), one obtains further relations: 1 + 2 = 1 and 2 = 1 − 1 . A fit of the experimental data with a bi-exponential function should yield the amplitudes 1 , 2 and the decay coefficients and . Often, the initial slope of the dose dependence of the frequency of intact oligomers can be obtained without elaborate fitting. It gives information on the total damaging, i.e. on the sum of the damaging coefficients + : The other damage or repair coefficients may be obtained from the fitting results using the following relations.

SI-9 A Molecular Model for the Damage Coefficient of Oligomers
In the single stranded chain used in the present experiments the most prominent damage processes are the formation of di-nucleotide lesions like the cyclobutane pyrimidine dimer CPD) lesion or single nucleotide damage. CPD is formed by bridging two neighboring pyrimidines (thymine or cytosine) by a cyclobutane ring. Those damages have typical quantum efficiencies in the 10 −2 to 10 −3 range. Quantitative information on the underlying molecular processes and a comparison with the quantum yields given in the literature may be obtained when combining the observations with a molecular model for damage formation. Here, we use a simple model, which is based on dimeric damage processes and nearest neighbor interactions and addresses only the initial step of damage formation (see scheme in Figure 5, insert). We assume (i) the same excitation of all bases in the strand, (ii) that the excitation of one base X in a WXY step is equally shared with the two neighboring bases W and Y leading to excited dimer states (WX)* and (XY)* with the same probability of * = * = 0.5 and (iii) causing damages of the respective dimers with the quantum yields Φ and Φ . As a simplifying assumption we use (iv) symmetric quantum efficiencies, Φ = Φ . This model connects the experimentally observed damaging coefficient olig with the quantum yields Φ of the dimers of the oligomer. The formation of the dimeric damage (X=Y) after the absorption of one photon in base should occur (assumption (ii) and (iii)) with the yield .
Here, Φ is the damage formation quantum yield from the excited state ( ) * . Simultaneously the absorption in base causes the formation of damage (W=X) with the yield .
Above, we assumed that the illumination with UV-radiation at 266 nm leads to the same excitation for all bases independent of the type (assumption (i)). When the exact absorption cross-section should be considered, correction factors must be applied to obtain exact quantum efficiencies. For the wavelength of 266 nm, used in the experiments, this correction should be less than 20% (see SI-7). The damage coefficient of a complete oligomer … .., may be calculated by summing over the damaging yields of the individual dimers considering the applied dose . This is justified since the complete oligomer remains only intact, when all dimers, composing the oligomer are undamaged. In other words, each damage of a dimer in the oligomer adds up to the damage of the complete oligomer. For the survival probabilities one may multiply the survival probabilities of all individual dimers to obtain the survival probability of the complete oligomer.
(SI-e 43) = Φ + Φ + Φ + Φ + ∆ + ∆ Here, special effort must be laid on the ends of the oligomer (correction term ∆ + ∆ ). When the oligomer is isolated (sequence ) no excitation transfer can occur to the outside. Thus, the damaging coefficient of the outmost dimers is increased by Δ = 0.5 Φ and Δ = 0.5 ⋅ Φ . When the oligomer is embedded in randomized bases (... ..), the corresponding excitation will lead to an increase due to the damaging coefficient of the dimer composed of and the precursor nucleotide or of and the subsequent nucleotide. Since these nucleotides are chosen in a randomized way, we obtain: This factorization of the survival probabilities was used above in Sections SI-2 and SI-4. The survival probability of the whole oligomer is the product of survival probabilities of its parts if and only if the survival chances of the parts are pairwise independent of each other. In the damage model presented above this independency originates from the transfer of the excitation of base W to the excited dimers (VW)* and (WX)* with equal efficiency. When the model is refined e. g. when considering processes due to charge transfer states there are deviations from the pure dimeric damage model and the survival probabilities of the different parts are no longer independent. However for weak excitation the dimeric mode may be taken as a first order approximation.

Symmetry of the damage coefficients.
In the presented evaluation procedure, the damage coefficient of a specific oligomer is calculated as a sum over the dimeric damage coefficients Φ . Thus, the knowledge of the 16 coefficients Φ ( , ∈ { , , , }) should be required to estimate the damage of arbitrary oligomers. The analysis of the experimental data has revealed that pairs of oligomer sequences with reversed order relative to the 5' or 3' ends, show similar dose dependences (see SI-6). Thus, the dimer damage coefficients can be assumed to be symmetric, Φ = Φ (assumption (iv)). E. g. one may assume the same value for dimer damage coefficient for the CT as for the TC dimer, Φ = Φ . Under these conditions there are ten independent values Φ : Φ , Φ = Φ , Φ = Φ , Φ = Φ , Φ , Φ = Φ , Φ = Φ , Φ , Φ = Φ , Φ . An example for the tetramer CATT with randomly chosen neighbors is given here.
(SI-e 46) Here the expressions within the parentheses are due to the left and right borders.

Determination of the dimeric quantum yield Φ from the experimental data.
It should be noted that in general the damage coefficients are linear combinations of the dimeric quantum efficiencies Φ . For the general case of oligomers of length one may write the corresponding relation in an appropriate vector form. The coefficients of the matrix can be directly determined from the base sequence of the respective oligomer together with the information on the terminal parts, as shown above.
In the experiment, the damage coefficients may be obtained from a fit of the experimental data, i.e. from the survival probabilities of the 4 different intact oligomers recorded as a function of the illumination dose . The solution of the linear system of Equations SI-e 47 with the known matrix and the experimental results can be used to determine the unknown molecular quantum efficiencies Φ 2 . These numbers should be compared with the dimeric quantum efficiencies recorded by other techniques (see main part, Figure 5).
During the data analysis, we recognized that the dimeric quantum efficiency Φ has the tendency to become negative, which would be physically unreasonable. Therefore, one may argue that one of the assumptions of the modeling procedure given above is not justified. A negative value of Φ could point to repair (unreasonable for an intact dimer) or to the possibility that the dimer would dump the excitation of an initially excited very efficiently (violation of assumption (i) to (iii)). The dumping could occur via the charge transfer state + − which is known to be efficiently formed in steps 4 . We discuss this point here in the context of a sequence ... .. cannot be formed via + − . Thus, the presence of a step can reduce the damage yield via dumping the excitation of the thymine . Similar processes could occur also for , and dimers. However, the redox potentials of the different bases indicate, that the excitation dumping should be most efficient in and steps. The dumping process can be incorporated in Equation (SI-e 47) by suitable modifications of the matrix .

SI-10 Fitting of damage rates
The isolated survival probabilities shown in Figure SI-f 5Figure SI-f 5 show a significant drop in the abundance of undamaged dimers over the course of the irradiation experiment. To quantify the effect, we fitted a three-state model to the experimental data, featuring one reversible and one irreversible state connected by a ground state SI-f 9b). Less complex models (e.g. two-state) were not able to fit the experimental data sufficiently well.
As shown before, this model implemented by a bi-exponential function of the absorbed dose :  The fitting of the dose dependences of the dimer, tetramer and hexamer data shown in the main document (Figure 3 and 4) was performed using the routine CurveFit with the function dblexp_XOffset from Igor pro (Wavemetrics, USA). The main difficulty of the data fitting originates from the fact, that the most simple mathematical model leads to a mono-exponential dose dependence of the relative frequencies, while a more complicate model (which is supported by the experimental dose dependences of some sequences) requires a bi-exponential fit function. Thus data fitting has to decide between a mono-exponential or a bi-exponential behavior and has to determine the amplitudes and the decay constants for the different oligomer sequences. To accomplish the data analysis we developed a multistep decision process: 1. For each sequence the complete dose dependence of the relative frequency was fitted by the double-exponential fitting tool with few constraints on the two amplitudes and the two decay constants (e.g. no rising components, sufficiently large amplitude of the fast component as compared to the data noise, clearly different decay constants). When this fit converged and supplied amplitudes and decay constants with sufficiently small errors, the data were directly used. Otherwise: 2. Two subsets of the data were selected for small dose values (D < 35 photon/base) and large dose values (D > 60 photon/base) and each subset is fitted by a monoexponential function. When both fits converged and yielded consistent data (no rising components, small dose values yield faster decay, sufficiently large amplitude of the fast component as compared to the data noise, clearly different decay constants) the results were combined to a bi-exponential function and the corresponding two amplitudes and two decay constants are stored. Otherwise: 3. The complete data set was fitted by a mono-exponential function. The amplitude and the decay constant were stored. 4. Finally the results of the multistep fitting-procedure were used to calculate the initial damaging constant i = i + i using equation (SI-e 37). When comparing the experimental data points and the modeling functions obtained by the presented procedure we find a reasonably "good" fit for nearly all sequences.

Experimental errors in the determination of the quantum yields.
Since the quantum yields for the dimeric damage shown in Figure 5 were obtained by the solution of the linear system of equations (SI-e 47), the statistical uncertainties of the experimental values i can only be used indirectly to determine the error of the quantum yields. These errors are estimated as follows: The linear system of equations (SI-e 47) is solved repeatedly (1000 times) with i values for the different sequences taken from a Gaussian distribution of the i (widths according to the fitting errors of the individual i). The total error is obtained by combining these statistical errors with the uncertainty of the determination of the dose (16% relative error).

SI-11 Error propagation
The statistical error from sequencing is estimated by using the standard deviation of a Bernoulli distribution The propagated error is then used to calculate the weights for fitting the state models as shown above. The resulting fitting errors for the rate parameters Δ , Δ , Δ are calculated via error propagation via the conversion functions SI-e 41-43. The errors of the fit parameters Δ , Δ and cross correlations are given by the fitting algorithm.

SI-12 Example for data handling
In this paragraph we present a typical example for the computation of the dose dependencies from the original Illumina data. The experiment yields the absolute frequencies of the oligomers with a specific sequence in the random part for the different samples . The samples are numbered according to the handling number in the analysis procedure. The raw data for the frequency of the central octamer (only oligomers with intact ACAC-tails are counted) are given in Striking is the absence of counts for the sample with = 1. Apparently, a fatal error occurred in the preparation/analysis process of this sample. This is confirmed by the analysis in chapter SI-13 where we demonstrated that the values = 1 and = 6 are outliers of the experiment and should be disregarded. The data in Table SI-t 2show that this is trivial for = 1. For the frequencies of the sample = 6 non-vanishing but small values occur. Even smaller absolute frequencies are found in the raw = 9 to = 11, however these small values may be explained by the strong illumination doses (see upper part). For the central octamer, the defined ACAC-tails lead to defined neighbors at the 5' (C) and 3' (A) ends of the octamer part. Since important DNA-lesions such as the CPD-photolesion involve adjacent bases, the tail sequences do not allow to use the given octamer data as a prototype for DNA-damage in random sequences. This is only achieved when the octamer data are averaged or summed over the outmost bases at position 0 and 7. The resulting oligomer, i.e., hexamer data (see Table SI-t  3) have always a random base outside the hexamer. Due to the summation over the terminal bases of the octamer, the related absolute frequencies are larger than for the octamer (approximately by one order of magnitude). Again, the outliers and the small absolute frequencies for the samples with remarkably high doses are clearly visible. Another striking feature is strong variation of the absolute frequencies with sequence for constant sample numbers (or doses). For given sequences there is the trend that the absolute frequencies decay with increasing dose. The absolute frequencies may further be increased upon appropriate summation of hexamer sequences leading to shorter oligomers of defined length (from pentamers to dimers). An example is given in Table SI-t 4, where the absolute frequencies for dimers are presented. The frequency values are now in the range of 10 6 . The sequence dependence is also visible in the dimer data and may be due to accidental sequence dependencies in the synthesis and the analysis/preparation procedure. It is visible for all dimer sequences. According to Equation SI-e 3 the sequence dependence may be eliminated by normalizing the absolute frequencies of a given sample to those of a reference sample. We chose the sample with vanishing dose, = 0, since the expected sequence dependence induced by illumination should be absent for this sample. By this normalization procedure, we obtain the normalized absolute frequencies displayed in Table SI-t 5. Here the sequence dependence is weak for the small dose values. At high doses, a pronounced sequence dependence is visible and points to a sequence dependence induced by illumination. Until now the variations in the sample preparation/analysis steps, i.e. during library preparation and Illumina analysis are not yet corrected for. They may influence the determined relative frequencies and may differ from sample to sample. In first order approximation they should be independent of the sequence. An additional, strong, and systematic sample dependence visible in Table SI-t 5 is related to the illumination induced damage leading to very small relative frequencies at high doses. The illumination dependence of the relative frequencies is due to the ACAC-tails, the random parts outside the considered dimer and the dimer (within a surroundings of random bases) itself. According to Equation SI-e 6 the survival probabilities of these three parts must be multiplied to obtain the survival probability of the measured 16mer.
According to Equation SI-e 4 and 5 we may eliminate the influences of tails and random parts by normalizing the relative frequencies by the relative frequencies of one sequence 0 where we expect a weak damage due to the dimer part. The resulting normalized relative frequencies (see Table SI-t 6) are obtained by using 0 for the dimer . These relative frequencies should represent the true relative frequencies multiplied by the inverse of the decay of the dimer part (see Equation SIe 5). Thus, the data for the dimer are equal to 1. To obtain the true dose dependence, one needs knowledge of the decay 0 of the dimer . It is well known that the related decay should be rather weak. An upper limit of the decay constant may be obtained from the fact that radiation induced processes of intact oligomers never lead to an increase of the frequencies.
is reached when the multiplication of the relative frequencies of Table SI-t 6 with exp (− ) yields only decaying curves. Further reduction of the decay constant may be obtained when additional information is available for the damage constants of specific oligomers. Another approach is used in Chapter SI-3 where the assumption of a defined dimer damaging coefficient Φ is used for normalization. For the present data set we assume − = 1890 photon/base and obtain the relative frequencies given in Table SI

SI-13 Detection of outliers
According Equation SI-e 1 the data of an experiment are represented by a twodimensional set of numbers, the count numbers (1) ( , ), obtained as a function of sequence number and sample number . In general, the illumination dose is systematically varied for the different samples each having a specific dose . However the preparation steps in the handling of the individual samples, such as e. g. the library preparation, the PCR-amplification, ... , may be subject to fluctuations in the handling procedure which may lead to variations in the count numbers. In a first order approximation one may assume that these variations are independent of the sequence and can be corrected in the data handling procedure.
The amount of the sample or preparation specific fluctuations has been investigated in a reference experiment, where two types of DNA-sequences are compared. Type 1: The standard DNA oligomers with ACAC tails and randomized octamers in the center. This solution is split into (here = 18) samples illuminated with doses . Type 2: DNA oligomers with GTTG tails and randomized octamers in the center. This solution is again split into ( = 18) samples but is not illuminated. Finally the DNA samples of both types are mixed (1 part of type 2 with 3 parts of type 1, in order to compensate for a reduction in oligomer numbers upon illumination) leading to = 18 samples which are subsequently treated by the Illumina process and analyzed. For each sample we obtain the count numbers for both types of DNA oligomers, which experienced -besides illumination -the same treatment. Thus, the GTTG-tailed, nonilluminated oligomers may serve as reference molecules for the illuminated ACACtailed oligomers of the respective sample . The GTTG-tailed oligomers also give an impression of the fluctuations imposed by the preparation of the samples.
In Figure SI-f 11we plotted the count numbers of the GTTG-tailed oligomers averaged over all oligomer sequences (blue) for the different sample numbers = 0 to 17. The graph shows that strong variations of the average count numbers by more than an order of magnitude occur for the identically prepared GTTG-tailed DNA samples. The average count numbers vary between 1 = 0 and 0 = 20.38. Another small average is found for sample = 6 with 6 = 0.072. These samples are true outliers in the preparation and may be omitted in the data evaluation. The rest of the samples have average count numbers between 2.2 and 18.2. The role as outliers for points = 1 and = 6 is also supported by an analysis where the Pearson correlation coefficient referred to = 0 ( 0 = 0) is calculated (red points in Figure SI-f  11a and c). The plots show, that the Pearson correlation coefficient may also be used to assign outliers in experiments. This observation is important for experiments without reference sequences.
In order to gain information on the additional influence of the illumination, the average count numbers of the ACAC-tailed oligomers (blue, filled dots) are plotted in Figure SI-f 11b versus sample number . Due to the difference in concentration for oligomers of type 1 and type 2 (see above) the average count numbers for ACACtailed oligomers are often found at higher values. To compare both types of oligomers, the data for the average count numbers of GTTG-tailed oligomers were scaled by a factor of 4.84 (coinciding points for = 0, black open dots). Again, there is a wide spread of the ℎ values reflecting the trend found for the non-illuminated GTTG-tailed oligomers (black). However, at some sample numbers (see e.g. = 7 to 11) there are significantly reduced values. This can be understood considering the illumination induced damage which leads to a strong decrease in at high illumination dose (see violet points). When plotting the average count numbers versus dose (see Figure SI-f 3) one finds a strong decrease with dose together with overlaid fluctuations due to the preparation of the samples.

SI-14 Analysis for the predominant occurrence of single-stranded DNA
For the calculations made before, it was assumed that the DNA is mostly in single strand form. Since the DNA pool contains a large number of different sequences and thus possible binding states, and the number of duplex formations will be quite small due to the short length of the DNA, a direct measurement of the fraction of bases involved in Watson-Crick (WC) pairing via UV absorption or intercalating dyes will only provide a rough estimate. Therefore, for a better understanding of the situation, we numerically calculated the binding probabilities for a large number of different sequences from the naïve pool. For this purpose, a local installation of the Nucleic Acid Package 7-9 (Nupack) was used and controlled in an automated way by means of a Labview interface in order to be able to calculate and evaluate the mutual interaction of a large number of sequences. Nupack determines the possible binding configurations and their free energy for a group of sequences with a given concentration (1 nM per sequence) under predefined buffer conditions (0.135 M NaCl, no MgCl2, 30°C). Due to the complexity of the problem, it is not possible to directly test the mutual interaction of 4 8 = 65536 sequences against each other. Instead, 50 sequences were randomly selected from the complete pool for which all relevant binding permutations were calculated. This procedure was repeated 200 times to improve statistics and approximate reality. From the calculations, we obtained binding energies for 70 ⋅ 10 3 different duplex configurations (SI-f12a), for which a non-vanishing concentration averaging 0.5% of the strand concentration per sequence was obtained for approximately 25% (SI-f12b). Nupack also allows to determine the number of bases involved in Watson-Crick pairing per duplex formed, which in our example is on average 4 bases per 16-mer strand. Thus, only about 0.1 % of all bases are involved in WC binding, so that our assumption of mainly single-stranded DNA is plausible.

SI-15 Determination of the damage yield for GG-dimers by ultraviolet spectroscopy
Information on the damage yield of the reference dime GG was obtained from an experiment, where the change of the UV-spectrum of a GG-dimer sample was recorded for different illumination doses. The sample: The dimer sample was obtained from Biomers, Germany. The sample was dissolved in phosphate buffer (10mM) to yield an absorbance of A ≈ 1 OD in the sample cell with an optical path length of 1 cm. The corresponding concentration CGG was estimated using the extinction coefficients given in Table SI-t 1: CGG = 47 M. The experimental system: Illumination of the sample was done with a Nd-based laser system (AOT-YVO-25QSP/MOPA) from Advanced Optical Technology, UK, operated at a repetition rate of 6.5 kHz. The UV radiation (4th harmonic of the laser) had an average power of ca. 20 mW with a beam diameter at the sample of ca. 1 mm. Double pass through the sample cell ascertained the essential part of the incoming light was absorbed. The residual transmission of the UV-radiation was considered when calculating the absorbed dose. For a set of illumination doses the absorption spectrum of the sample (1 cm path length) was recorded using a Shimadzu UV1900 spectrophotometer. Results: In Figure SI-f13 we plotted the UV-absorption spectra of the GG sample before illumination (red) and after illumination with increasing dose values. The different spectra recorded at dose values until several 1000 photon/base show a remarkably similar shape. This indicates that there is only one process in this dose range, namely the decay of the initial absorption band towards a state with a weaker absorption and a wing extending to ca 350 nm. It requires more the 5000 photon/base to reduce the original absorption (e.g. at 266 nm, Figure SI-f 13) by 50%. The spectrum recorded at a dose of 650 photon/base shows, that on the dose level used in the illumination of the oligomers in the main paper, the concentration of the GG-dimer is essentially constant. This observation supports the assumption, that poly-G containing oligomers are well suited to serve as reference oligomers with smooth and very slowly decaying frequencies.

SI-16 Assessment of the sensitivity of the method for detection of 8oxoguanine
To evaluate the sensitivity of the method or the polymerases used for further known structural damage of DNA, we performed measurements with DNA strands containing an 8-oxoguanine instead of a guanine at a central site. Since this damage is monomeric, i.e. it involves only one nucleotide, it should be significantly less detectable by the polymerases used. Two samples were prepared for the measurement: One sample contains the modified target sequence ACAC AGAGAG ACAC and a reference sequence ACAC ACAGAGAC ACAC, each at a concentration of 5 µM in 1x PBS. The second sample also contains the reference sequence at identical concentrations and the modified target sequence ACAC AGA(8-oxo)GAGAG ACAC which has an 8-oxo guanine instead of the second guanine. Both samples were prepared for sequencing in the same way as the measurements shown throughout this work and were finally sequenced. For the evaluation, the sequence frequencies were each normalized by the frequencies of the reference sequence contained in the respective sample, so that both samples could be related to each other. Finally, all relative frequencies obtained in this way were normalized by the frequency of the unmodified target sequence. Figure SI-f14 shows on the left how the relative frequency of the target sequence drops from 1 in the sample without 8-oxo modification to 0.026 in the sample containing the 8-oxo modified target sequence. This detection amplitude of ca 30-times larger than the CPD damage detection in the case of a T=T CPD-lesion (main text Figure 2b). Moreover, the frequency of another sequence (ACAC AGATAGAG ACAC) simultaneously increases from 10 -6 (i.e. a single readout event) in the sample without 8-oxo-guanine in the sequence to 0.7 in the sample containing 8-oxo-guanine. Therefore, it can be assumed that the polymerase used in the preparation kit does not lead to termination at the 8-oxo-guanine modification, but causes an erroneous readout (T instead of G). In the detection of CPD lesions performed in the main text, no increase of certain sequences with increasing exposure dose was found. Thus, the sequence preparation actually resulted in a termination of the polymerase activity and not in an erroneous readout.
Figure SI-f 14. Change in the relative frequency of a target sequence ACAC AGAGAG ACAC upon insertion of an 8-oxo-guanine at position 8. While this sequence could be detected less frequently by a factor of 0.027 (left), the modification led to the appearance of an erroneous sequence in which a thymine was detected at position 8 instead of a guanine (right). The frequency of this altered sequence with the relative frequency of 0.7 corresponds approximately to the decrease in the frequency of the correct target sequence. Apparently, the polymerase used causes predominantly the replacement of 8-oxo guanine by a thymine instead of the a termination.

SI-17 Characteristic data on the dose dependences for tetramers and hexamers
Table SI-t 8. Decay parameters for tetramers. Total initial decay coefficients olig , 50% dose values calculated from the initial decay constant olig (D50%, mono) and from the fit-curves (D50%, fit). D50%, fit > D50%, mono points to a pronounced bi-exponentiality of the dose dependence. The error olig contains only the statistical part from the precision of the data points and the fitting procedure. It does not consider the systematic uncertainty (±20%) in the determination of the irradiation dose.