Automatic detection of methane emissions in multispectral satellite imagery using a vision transformer

Curbing methane emissions is among the most effective actions that can be taken to slow down global warming. However, monitoring emissions remains challenging, as detection methods have a limited quantification completeness due to trade-offs that have to be made between coverage, resolution, and detection accuracy. Here we show that deep learning can overcome the trade-off in terms of spectral resolution that comes with multi-spectral satellite data, resulting in a methane detection tool with global coverage and high temporal and spatial resolution. We compare our detections with airborne methane measurement campaigns, which suggests that our method can detect methane point sources in Sentinel-2 data down to plumes of 0.01 km2, corresponding to 200 to 300 kg CH4 h−1 sources. Our model shows an order of magnitude improvement over the state-of-the-art, providing a significant step towards the automated, high resolution detection of methane emissions at a global scale, every few days.

Reviewer #1 (Remarks to the Author): The authors present a novel vision transformer model to automatically detect methane plumes in Sentinel-2 satellite imagery.They report an order of magnitude improvement in detection limit relative to the current state of the art, down to 200 kg per hour.This is potentially a very important contribution, as the volume of methane-relevant satellite data has recently ballooned and now far surpasses the ability of human analysts to parse it.As exciting as the work is, however, I believe substantial revisions are needed before it can be accepted for publication.I have three main concerns: First, I do not feel the results at present are sufficiently verified.The analysis of Figure 3 does not require overlapping passes between aircraft and satellite, which makes it of limited use for verification (more on this below).It is not clear to me whether the same is true for Figure 4.If the example plumes in Figure 4 can be directly compared with Carbon Mapper plumes on the same days, then the authors should include the side-by-side comparisons in the supplement.That would give some confidence in their detections.What I would really like to see, however, is a systematic comparison with the controlled releases performed by Sherwin et al. (2022).Several of those releases were timed with Sentinel-2 passes, and the emission rates ranged from ~1000-6000 kg per hour.Those plumes should be easily detectable by the authors' transformer model, and there were just a handful of them, so this is not asking a lot.A side-by-side comparison of the detected plumes with the Sherwin et al. plume images and ground-truth emission rates could provide strong support for the proposed model.
Second, Figure 3 raises several questions: 1.If the model is being applied to more than 2500 Carbon Mapper point sources, then why are there so few black points?Is some binning of sources being done? 2. How can there be 950 applications in three months prior to each Carbon Mapper detection when Sentinel-2 passes at most once every 2 days?3. The transformations applied to the x-axis complicate interpretation of the results.If I understood correctly, the independent variable is not actually the catalogued Carbon Mapper plume rate but rather a prediction of the plume rate from the catalogued plume extent.But looking at supplemental Figure 2, this prediction is quite uncertain.The figure tries to say something about source strength, but it may say more about plume size, which is itself a function of source strength but also wind speed, topography, etc.I would suggest showing the comparison with the plume rates catalogued by Carbon Mapper somewhere in the manuscript.4. Perhaps most importantly, the two caveats pointed out in the text (L.[183][184][185][186][187][188][189][190][191][192]) are major ones.The Carbon Mapper surveys are predominantly in oil and gas fields, where sources are highly intermittent and clustered.Cusworth et al. (2021) reported a very wide range of persistence rates for Permian point sources, from ~0.1-0.9.The probability of the automatic detector marking "at least two adjacent pixels within 500 m of the leak" -i.e., within a 1-km disk -as positive for methane could be relatively high in the Permian basin, especially over a 3-month period.It is certainly not the same thing as the probability of the detector spotting the plume from a particular source.This concern could be addressed by applying the detector to random locations in the Permian -not merely to a location in the testing set, which is comprised of areas without known methane emitting infrastructure.How would that compare to the blue "false positive rate" point?Third, the authors should attempt to explain why this new approach works so much better than previous ones -not just thresholded MBMP retrievals, but also U-Net detectors for hyperspectral instruments like PRISMA (Joyce et al., 2023) and GHGSat (Bruno et al., 2023), which one would expect to provide better methane detection than the spectrally much coarser Sentinel-2 but which often fail to detect larger plumes.Is self-attention the silver bullet here?Did the loss function or optimizer play a role?What is the input feature importance?I wonder whether the inclusion of reference imagery from a previous time t-1 might be a crucial element.More discussion of why the proposed approach is so successful would be appreciated.Some more specific questions/comments: 1. Can the vision transformer approach be used to quantify methane plumes?I was left wondering what one would do to quantify emissions after using the proposed model to detect the plumes.
Perhaps this is future work, but it could use some discussion.2. L. 68 and L. 230: The claim about 200 kg/h and "the vast majority" of methane from point sources seems to refer to Figure 7 from Jacob et al. (2022), but that was for US aerial surveys, mostly of oil and gas basins.One of the surveys in that plot, for California, shows >200 kg/h accounting for less than half of the point source budget -and we don't know much about the source distribution in other countries/industries.Some caution is needed here.3. L. 97 and L. 283: Please say more about this auto-correlated noise for reproducibility.4. L. 97-98: "further emulate atmospheric turbulence" -This should be rephrased for clarity.The Gaussian plume model doesn't emulate turbulence at all, so adding auto-correlated noise does not "further" do that.Same issue on L. 283 with "further mimicking". 5. Fig. 2  The authors present a new deep learning approach to detect methane plumes in Sentinel-2 multispectral satellite imagery.They construct a large training dataset by downloading Sentinel-2 data and synthesizing methane plumes to serve as training labels, then evaluate the approach on a set of real plume data.They find the model can detect many of the real plumes without identifying a substantial amount of false positives, suggesting the potential for the approach to serve as a cheap, efficient, and accurate alternative to other methane plume detection approaches.
While I deeply appreciate the mission of the work and believe that the approach has potential, I see several major issues with both the methodology and conclusions made from the results, so I cannot recommend the paper for publication.I've provided my major and minor comments about the work below.

Major Comments:
1.There are several claims made in the paper which are poorly supported by the results presented.1(a).Importantly, I am not convinced from the results that the major conclusion of the paper (the only result highlighted in the Abstract, lines 20-23), that the model can "reliably detect" the point sources down to 200kg CH4 h-1, is true.1(a)(i).Figure 3 shows that the model detected ~80% of the expected visible plumes between 200 and 350 kg/h then about 90-95% between 400 and 1000.Missing 20% of plumes seems substantial.Without a baseline or reference method to compare against, claiming this is reliable detection is subjective.Furthermore, determining these aggregate numbers of how many plumes were detected above some plume rate, which are key to the main conclusions of the paper, was not easy to do from the figure.Those should be presented numerically or at least shown more clearly in a revised or additional figure.1(a)(ii).The false positive rate was approximated and therefore it is not clear how reliably the model will detect plumes in practice.Furthermore, even if the estimated false positive rate of 3-4% is true, that corresponds to at least 2,400 false positives 80,000 images.The authors state a few times that future work will focus on reducing these, but the limitation should be acknowledged more explicitly, and the major conclusion should be tempered accordingly.1(b).Lines 69-70 state their approach unlocks fully automated monitoring of emissions, but the approach produces a nontrivial amount of false negative and false positives detections which may prohibit practical use.This claim needs to be changed or better supported by the analysis in the paper.1(c).The authors make the claim on lines 82-84 that the database contains areas that do not encompass known potential methane sources, but they do not thoroughly describe how they ensured this while sampling, they do not supply an exhaustive list of potential methane sources, and they state they were only measured far from oil and gas activity on line 160, which I believe puts into question whether the claim on lines 82-84 is true.The sampling process should be described in more detail, and if the authors believe there could be known potential sources in the negatives, this claim should be modified and conclusions made from results should be adjusted accordingly if needed.
2. Several methodological decisions were made without sufficient motivation and explanation, some of which may affect the results and conclusions made from the paper.2(a).Why use a single pair of consecutive times?Why only predict a single plume image?Does the plume correspond to one of the times, the difference, or something else? 2(b).Why are all input bands sampled to 20m? How is the downsampling for 10m bands and upsampling for 60m bands done?2(c).What operation is used to perform the upsampling in the UNet architecture?2(d).Why did the authors design and use the proposed deep learning architecture rather than other well-established architectures?ViTs are known to struggle with dense prediction tasks, so comparing to well-established CNN-based segmentation architectures and ViT-based segmentation architectures that have been modified for dense prediction would much better support the use of this architecture, should it outperform those approaches.Furthermore, several implementation details were left out, like the optimization procedure (optimizer, learning rate, batch size, etc.) for example.2(e).Why split datasets by region if the data is synthetically generated?Wouldn't one want to include all regions in the training set to improve the generalization, rather than excluding some regions?2(f).Regarding the argument made on lines 178 -181, even though training on synthetic data was not centered around the plumes, doing so when testing the model is biased.If the goal is to understand how well the model works for detecting real methane plumes in practice, the locations of the methane plumes and often even the potential sources will not be known apriori and therefore the images will not be centered on the plumes.The evaluation needs to be redone without centering it in order to identify whether the model is robust to perturbations in the locations of the real methane plumes.Introduction 1.The background and motivation for the work is solid.However, the claims made in the last paragraph are exaggerated.Lines 67-70 again claim that the models robustly detect emissions down to 200kg/h, and that it unlocks fully automated monitoring, which I don't think the current results sufficiently support as I've explained above.

Results
1. It'd be helpful to plot the locations of the training / validation / test tiles on a map, at least as a supplementary figure, so that readers can gain a better understanding of the geographic distribution of the splits.2. References to Fig. 2B and 2C are swapped in the text from lines 139-157 (or the figure/caption labels need to be swapped) 3. The model is not a transformer.It is a semantic segmentation model with a vision transformer encoder.I'd suggest modifying language around this throughout the paper (e.g.lines 102, 110), potentially including the title.4. Line 114 and the caption state that the model is an auto-encoder but it is not.An auto-encoder attempts to reconstruct the input.This is an encoder-decoder segmentation architecture.5. Lines 158-160 claim the FPR is reliably evaluated, but I think the authors are trying to say that their estimated FPR is likely an upper bound of the true FPR, given some of the detected plumes may actually be real plumes, suggesting it is not reliably evaluated.Additionally, for the exact reason the FPR is likely overestimated, the FNR may be underestimated, which could be inflating the AUC. 6.Why switch between kg/h and tons/h on line 167?Better to stay consistent with one (or report both).Discussion 1.I think the discussion does a better job of making more supportable conclusions than other parts of the paper.However, the claim on lines 233-236 again seems too strong as the authors do not present results which sufficiently compare to these constellations.I'd suggest softening this claim or including these results.Methods 1.How were the Sentinel-2 images downloaded? 2. How was cloud cover determined?
We are very grateful for the time and effort spent by the reviewers on our paper, and for their comments that helped us to drastically improve the manuscript.Notably, by following the input of the reviewers, we have further trained our model and applied it more conservatively such that our false positive rate has been reduced.The supplementary materials have been expanded to assess the behavior of our model in a variety of circumstances.Last but not least, following the advice from reviewer 1, we have applied our model to a recent controlled release experiment organized by Stanford.
Ca g[X Yb__bj\aZ) g[X eXi\XjXefs Vb``Xagf TeX \a U_TV^TaW bhe TafjXef TeX \a _\Z[g U_hX) TaW g[X edited text in the manuscript and its supplementary is in dark blue.

Reviewer #1 (Remarks to the Author):
The authors present a novel vision transformer model to automatically detect methane plumes in Sentinel-2 satellite imagery.They report an order of magnitude improvement in detection limit relative to the current state of the art, down to 200 kg per hour.This is potentially a very important contribution, as the volume of methane-relevant satellite data has recently ballooned and now far surpasses the ability of human analysts to parse it.
We thank the reviewer for these kind and encouraging words!As exciting as the work is, however, I believe substantial revisions are needed before it can be accepted for publication.I have three main concerns: We tried to answer these main concerns as well as possible; explanations and additional information, as well as individual changes to the manuscript and supplementary materials, are described in what follows.
First, I do not feel the results at present are sufficiently verified.The analysis of Figure 3 does not require overlapping passes between aircraft and satellite, which makes it of limited use for verification (more on this below).It is not clear to me whether the same is true for Figure 4.If the example plumes in Figure 4 can be directly compared with Carbon Mapper plumes on the same days, then the authors should include the side-by-side comparisons in the supplement.That would give some confidence in their detections.What I would really like to see, however, is a systematic comparison with the controlled releases performed by Sherwin et al. (2022).Several of those releases were timed with Sentinel-2 passes, and the emission rates ranged from ~1000-6000 kg per just a handful of them, so this is not asking a lot.A side-by-side comparison of the detected plumes with the Sherwin et al. plume images and ground-truth emission rates could provide strong support for the proposed model.Thank you for your suggestion.Indeed, because the satellite passes and the Carbon Mapper airborne campaigns are not synchronized, it is impossible for us to directly compare the plumes detected by our algorithm to the plume masks reported by Carbon Mapper (this is true for both Figures 3 and 4).We have changed the text of the main manuscript to make it clearer that our argument is statistical, and we have toned down the claim that our model is able to detect every leak above 200 to 300 kg/h.Statistically, if our model is able to detect the Carbon Mapper plumes, its detection rate over the catalogue should be around the average persistence of the plumes (defined here and in the literature as the probability of a leak to be active during a random measurement).Because our detection rate has an asymptote towards this mean persistence for leaks above 200 to 300 kg/h, while our detection rate sharply drops below that, we argue that our model is able to detect methane leaks above 200 to 300 kg/h.PX T_fb `bW\Y\XW @\ZheX 0 gb ba_l Vbaf\WXe LXag\aX_ / \`TZXf j\g[\a 4 WTlf bY g[X T\eUbeaX c_h`Xfs detections, to strengthen the comparison.We have added the following summary in the main text to try to make it clearer: &_+/50' pBXeX jX f[bjXW W\eXVg Xi\WXaVX ba eXT_ LXag\aX_ / WTgT X`UXWWXW j\g[ flag[Xg\V c_h`Xf that deep learning models are about an order of magnitude more robust to background noise (Fig. 2).In comparing the deep learning detections with airborne detections of real methane leaks in the U.S, we showed evidence, albeit indirect due to the time difference between satellite and airborne acquisitions, that this one order of magnitude improvement carries over to an operational setting (Figs. 3 and 4).Finally, on controlled releases of real methane plumes, we directly showed that deep _XTea\aZ VTa UX hfXW gb Thgb`Tg\VT__l TaW U_\aW_l WXgXVg _TeZX `Xg[TaX X`\ggXef &@\Zf+ L0 TaW L1'+q We added two new figures in the Supplementary Materials (Figures S3 and S4) analyzing the application of our model to the latest controlled methane releases performed by Sherwin et al. [1]. Figure S3 shows that our model is able to successively detect all four controlled releases timed with Sentinel-2 overpasses, and compares our detection with the masked detections made by the groups that participated in the blind test.Interestingly, while most groups missed the smallest plume (1.1 ton/h -the smallest leak ever detected in Sentinel-2 according to the literature), our algorithm is able to successfully detect it.Note that here again, there is no proper ground truth in terms of the c_h`Xfs `Tf^f) Uhg g[X biXeT__ f[TcX TaW W\eXVg\ba bY bhe eXge\XiXW c_h`Xf TeX iXel f\`\_Te gb g[bfX computed by the participating groups, and are coherent with the release location and the local wind direction (information that are not provided to our algorithm).
Figure S4 compares our detections with the unmasked methane retrieval fields computed by the groups participating in the blind test, and illustrates the amount of false positives typically present in state-of-the-art analyses.The detection of methane plumes in these methane retrieval fields typically involves a human operator, while our approach detects plumes automatically.We note also that for these masked detections, all groups used the information of the location of the release as well as the wind direction, while our model detected the plumes completely blindly.
Although our Carbon Mapper analysis cannot definitely prove automatic detection capabilities down to 200 kg/h (and only provides a statistical argument), we argue that this further analysis of the Sherwin et al. dataset proves that our algorithm can automatically and blindly detect leaks down to 1100 kg/h.We are very thankful to reviewer 1 for suggesting this additional analysis which greatly strengthens our manuscript, and we have changed the main text to argue for this distinction between a proven automatic detection capability down to 1100 kg/h and a likely capability down to 200 kg/h, which will require further testing against future controlled releases.
Second, Figure 3 raises several questions: 1.If the model is being applied to more than 2500 Carbon Mapper point sources, then why are there so few black points?Is some binning of sources being done?Thank you for raising this question.Indeed, binning is performed o each of the bins corresponds to the application of our model to a number of pairs of Sentinel-2 images.In the new version of Figure 3, the second image of the pair is taken in the seven days before the leak, and the first image in the 3 months prior to the leak.To further analyze the behavior of our approach, we have separated campaigns conducted with two distinct instruments (AVIRIS-NG and GAO), and we find that our results are consistent for both instrument types.Each bin for AVIRIS-NG includes 332 applications to pairs of Sentinel 2 images, and each bin for GAO 442.Each point shows the mean and the 95% confidence interval of the bin.
2. How can there be 950 applications in three months prior to each Carbon Mapper detection when Sentinel-2 passes at most once every 2 days?This is because the model was applied on pairs of Sentinel-2 images, wherein each individual image is captured on a different day, with the reference date in the 3 months prior to the airborne detection, and the second detection date within 7 days of the airborne detection.Combining two dates enables to generate a much larger number of data points and to tighten confidence intervals, explaining the 4---Tcc_\VTg\baf bY bhe `bWX_ WXfVe\UXW \a g[X Ybe`Xe @\ZheXsf VTcg\ba+ In the previous version of the manuscript, both dates were within 3 months before the leak, but we have tightened this to 7 days so that the leak is more likely to be there.
We have added the following explanations to try to make it clearer: &_+.66' pM[\f cT\e bY LXag\aX_ / \`TZXf Vbaf\fgf \a T eXYXeXaVX WTgX \a g[X 0 `bag[f ce\be gb g[X _XT^) and a detection date in the 7 days prior to the leak.We further restrict our analysis to cloudless days (less than 0.5% cloud cover), which yields a total of 7724 possible pairs for the 2526 leaks of the VTgT_bZhX+q &_+/.5'pM[X =TeUba GTccXe VTgT_bZhX eXZebhcf WTgT Yeb`VT`cT\Zaf cXeYbe`XW j\g[ gjb instruments (AVIRIS-NG and GAO).Our detection results are summarized separately for the AVIRIS-NG (black) and GAO (grey) campaigns, in 10 bins, with each bin showing the average detection rate of our deep learning model, the average leak size and rate in the bin, for 332 and 442 LXag\aX_ / cT\ef cXe U\a) eXfcXVg\iX_lq 3. The transformations applied to the x-axis complicate interpretation of the results.If I understood correctly, the independent variable is not actually the catalogued Carbon Mapper plume rate but rather a prediction of the plume rate from the catalogued plume extent.But looking at supplemental Figure 2, this prediction is quite uncertain.The figure tries to say something about source strength, but it may say more about plume size, which is itself a function of source strength but also wind speed, topography, etc.I would suggest showing the comparison with the plume rates catalogued by Carbon Mapper somewhere in the manuscript.
M[Ta^lbh Ybe lbhe Vb``Xag+ QXf) g[\f \f XkTVg_l e\Z[g) jX Y\aW X`c\e\VT__l g[Tg bhe `bWX_sf cXeYbe`TaVX \f `bfg_l fXaf\g\iX gb g[X c_h`Xsf XkgXag) eTg[Xe g[Tg g[X _XT^eTgX+ M[XeX \f \aWXXW T rather uncertain relation between plume rate and plume extent, that also depends on wind and topography, and the fact that catalogued plume rate inversions themselves come with large uncertainties.To make it clearer, we have changed the axes of Figure 3 to show the plume extent as the main We\iXe Ybe bhe `bWX_sf cXeYbe`TaVX8 jX f[bj bhe eXfVT_XW _XT^eTgX Tf T Zh\WX) TaW \aW\VTgX fb \a the text: &_+/05' p@\Z+ 0 f[bjf g[Tg bhe `bWX_sf cXeYbe`TaVX \f fXaf\g\iX gb g[X XkgXag bY g[X c_h`X &Tf derived from the airborne detection) more so than the catalogued plume rate inversion (see Supplementary Fig. 8), with a clear breaking point at 10,000 square meters.The plume rate shown in the figure gives an estimate of the corresponding leak rate, and stems from a simple regression of _XT^eTgX iXefhf c_h`X XkgXag \a g[X =TeUba GTccXe VTgT_bZ &fXX Lhcc_X`XagTel @\ZheX /'+q We also added a new figure in Supplementary Material (Fig. S8) illustrating the relationship UXgjXXa bhe `bWX_sf eXfh_gf TaW g[X c_h`X eTgXf eXcbegXW Ul =TeUba GTccXe) j\g[ ab eXfVT_\aZ performed, along with the following explanation: &Lhcc_+ @\Z+ 5' p@\Z+ 5 f[bjf g[X YeTVg\ba bY VTgT_bZhXW =TeUba GTccXe _XT^f g[Tg jXeX WXgXVgXW Ul our deep learning model, as a function of leak rate.Compared with Fig. 3 of the main text, we can see that the main factor explaining detection is the plume's extent, rather than the plume's leak rate.However, plume extent and plume rate are closely correlated, with variations mostly due to wind conditions, and the plume's extent can be rescaled as an expected plume rate (using the fit from Fig. . bY g[X fhcc_X`XagTel') Tf f[bja \a g[X fXVbaWTel Tk\f bY @\Z+ 0 bY g[X `T\a gXkg+q 4. Perhaps most importantly, the two caveats pointed out in the text (L.183-192) are major ones.The Carbon Mapper surveys are predominantly in oil and gas fields, where sources are highly intermittent and clustered.Cusworth et al. (2021) reported a very wide range of persistence rates for Permian point sources, from ~0.1--+6+ M[X cebUTU\_\gl bY g[X Thgb`Tg\V WXgXVgbe `Te^\aZ pTg _XTfg gjb TW]TVXag c\kX_f j\g[\a 2--`bY g[X _XT^q o i.e., within a 1-km disk o as positive for methane could be relatively high in the Permian basin, especially over a 3-month period.It is certainly not the same thing as the probability of the detector spotting the plume from a particular source.This concern could be addressed by applying the detector to random locations in the Permian o not merely to a location in the testing set, which is comprised of areas without known methane emitting \aYeTfgehVgheX+ Bbj jbh_W g[Tg Vb`cTeX gb g[X U_hX pYT_fX cbf\g\iX eTgXq cb\ag: We have added such a false positive rate to Fig. 3 by applying the detector to random locations in the Permian and the other locations of the catalogue, which summarizes the odds of a making a random detection, instead of making a detection at the time and place of a known leak.
We note that when our neural network is fed a pair of Sentinel 2 images, it is only tasked with finding methane in the second image, with the first image being only used as a reference.Therefore, the odds of detecting a leak other than the one being assessed against should be independent from the detection date and the time difference between the reference date and the detection date.We have also added another test away from the Permian Basin, but in a similar environment, near Las Cruces in southern New Mexico.This gives us a false positive rate in similar environments as that of the Permian, that can be directly compared with the detection rate of other leaks at random in the Permian and other areas surveyed in the Carbon Mapper catalogue.
We have added the following text to describe these tests: &_+//1' p@\Z+ 0 T_fb fh``Te\mXf bhe WXgXVg\ba eTgX j[Xa Tcc_l\aZ g[X XkTVg fT`X `Xg[bWb_bZl \a the absence of known leaks, for three different tests.i) The blue bin shows our average detection rate when applying our model to pairs of Sentinel 2 images over southern New Mexico (but away from the Permian Basin), and provides an estimated false positive rate of 0.7% in conditions similar to that of the Permian Basin.ii) The green bin shows our false detection rate using the exact same methodology as elsewhere in this figure, but for pairs of Sentinel 2 images from our test set with no plume embedded.This yields a false positive rate of 0.9%, which is in agreement with a pixel-wise false positive rate estimated below 0.03% (as shown in Fig. 2).
iii) The red bin shows our deep learning model's detection rate over the regions surveyed in the Carbon Mapper catalogue, but at random locations and times (instead of centering the model's input on known leaks).This detection rate essentially shows the chances of detecting a different leak (and/or making a false detection) at random when assessing the detection of a particular leak of \agXeXfg \a Ta TeXT VbagT\a\aZ `Xg[TaX fbheVXfq+ Third, the authors should attempt to explain why this new approach works so much better than previous ones o not just thresholded MBMP retrievals, but also U-Net detectors for hyperspectral instruments like PRISMA (Joyce et al., 2023) and GHGSat (Bruno et al., 2023), which one would expect to provide better methane detection than the spectrally much coarser Sentinel-2 but which often fail to detect larger plumes.Is self-attention the silver bullet here?Did the loss function or optimizer play a role?What is the input feature importance?I wonder whether the inclusion of reference imagery from a previous time t-1 might be a crucial element.More discussion of why the proposed approach is so successful would be appreciated.
Thank you for raising this question, that we indeed did not address enough in the manuscript.We believe that the performance of our algorithm is due in priority to two factors: -First, the reliance on simple Gaussian models to generate synthetic methane plumes to train the model.To our knowledge, all of the deep learning approaches currently published with the goal of recovering methane signatures in spectral data rely either on real plumes, or on WRF-LES simulation schemes to train the models.This is a major issue, because the resulting training datasets are way too small to train a robust model, typically on the order of a few thousand of examples.There are currently not enough real methane plumes in existing databases (and using real plumes introduces the issue of segmenting them first -Another critical explanation for the success of our method is indeed the use of two dates for the input of our model, with a reference data and a detection date, such that the deep learning model is free to learn an optimal methane retrieval method.The conceptual idea is very similar to the MBMP method.With respect to methane, there are many sources of noise in Sentinel-2 or other spectral data that can be drastically lowered when comparing the same location at two different times.In particular, surfaces that absorb in SWIR bands will have similar reflectance at both reference and detection dates (which explains why the comparison of band ratios at different times in MBMP enables to remove a large portion of the noise).In contrast, the presence of methane at the detection date only will reduce reflectance in band 12 but not in band 11, and only at the second date.Feeding two images to the algorithm enables it to distinguish persistent absorption in SWIR at a given location (most likely noise w.r.t methane signals) from non-persistent absorption in band 12 only (possible methane absorption signals).We initially built models using a single time, but these performed a lot more poorly.
Relying on transformer models instead of e.g.CNNs does improve performance, but this improvement is limited compared with the impact of the two points listed above.
We have added the following text in the discussion part of the paper to try to explain the performance of our approach: &_+.1/' pPX Xkc_T\a g[X fgebaZ cXeYbe`TaVX bY bhe `bWX_ Ul g[eXX `T\a YTVgbef7 \' eX_l\aZ ba Gaussian plumes randomly embedded in real Sentinel 2 data (instead of real plumes or computerintensive WRF-LES simulations), enables us to generate a training dataset that is orders of magnitude larger than the ones typically used in previous attempts at developing deep learning models for methane detection, thereby enabling to fully train large deep learning models.ii) The use of two time-steps as input (conceptually similar to the MBMP approach) enables the model to use the first image as a reference image, to which the second image is compared in order to identify transient signals in methane absorbing band 12, while false positives in band 12 can be discriminated using the other bands and their evolution over the two time-steps.This comparison is crucial to distinguish signal from noise and correctly detect methane plumes, in particular the smaller ones.And iii) the use of transformers instead of convolutional neural networks (CNNs) enables our model to capture the long-eTaZX aTgheX bY T c_h`X+q Some more specific questions/comments: Thank you for raising this issue.We indeed used this number because it corresponds to the median of the various campaigns analyzed in Jacob et al. ( 2022), but it is true that emission distribution can vary drastically from one region of interest to the other, and our current description may be confusing in that respect.In particular, for one airborne campaign analyzed in California, the fraction reported in Jacob et al. (2022) would only be 30% (this number is still higher than 70% for the other Californian campaigns).We have also recently learned from presentations at AGU that leak size distribution in Alberta, Canada, is on average much lower than in the U.S., and the aforementioned statement would not apply to this region.
To clarify our results, we have modified both paragraphs in the main text: &_+36' pIhe eXfh_gf fhZZXfg g[Tg bhe `bWX_ WXgXVgf `bfg `Xg[TaX X`\ff\baf Wbja gb c_h`Xf bY -+-.km 2 , corresponding to methane leak rates of 200 to 300kg/h (with variations depending on wind conditions).Leaks of this size account for the vast majority of the estimated methane budget coming from point sources for most airborne campaigns in California, Colorado, and the Permian <Tf\a TaT_lmXW \a T eXVXag fheiXlq+ &_+/64' pCa cTeg\Vh_Te) fhV[ X`\ff\baf TVVbhag Ybe g[X iTfg `T]be\gl bY N+L+ `Xg[TaX cb\ag-sources in volume from methane-emitting areas analyzed in a recent survey of airborne campaigns in several LgTgXf+q In order to produce Fig. 2, 40 bins of equal number of samples along the SNR axis are built and the median F1 score in each bin is shown as the thick lines.We have added the following parenthesis to clarify this: (Caption of Figure 2) p&bY XdhT_ ah`UXe bY fT`c_Xf T_baZ g[X LHK Tk\f'q 3+ F+ /417 P[Tg \f g[\f pWTgT-Ye\XaW_lq Ybe`Tg: Data is saved in the hdf5 format.We have changed the text to be more specific, and removed the fhU]XVg\iX rWTgT-Ye\XaW_ls7 &_+012' pM[XfX g\_Xf TeX g[Xa fhU-divided into smaller windows of 2.5 x 2.5 km 2 , and saved into B>@2 Y\_Xf Ybe YTfgXe fT`c_\aZ Whe\aZ geT\a\aZ+q Typos -L.28 in combination to -> in combination with -L.67 models -> model -Fig. 1

Reviewer #2 (Remarks to the Author):
The authors present a new deep learning approach to detect methane plumes in Sentinel-2 multispectral satellite imagery.They construct a large training dataset by downloading Sentinel-2 data and synthesizing methane plumes to serve as training labels, then evaluate the approach on a set of real plume data.They find the model can detect many of the real plumes without identifying a substantial amount of false positives, suggesting the potential for the approach to serve as a cheap, efficient, and accurate alternative to other methane plume detection approaches.
While I deeply appreciate the mission of the work and believe that the approach has potential, I see several major issues with both the methodology and conclusions made from the results, so I cannot eXVb``XaW g[X cTcXe Ybe chU_\VTg\ba+ CsiX cebi\WXW `l `T]be TaW `\abe Vb``Xagf TUbhg g[X jbeb elow.
Thank you very much for the time spent on reviewing our manuscript and for your comments and suggestions, which helped us to markedly improve our results.We agree that the mission of automatizing methane detection is critical, and we have done a lot of further work since the initial submission to try to address the issues raised by Reviewer 2. In particular, we have retrained the model on 3 times more data, including on more challenging data (more vegetated areas, and cloudier data), which greatly improved the performance of our model by dramatically reducing its false positive rate.
We have also toned down our claims and conclusion, as indeed the comparison with the leaks catalogued by Carbon Mapper only gives us an indirect indication of our detection performance (because the aerial and satellite passes are not at the same time).We have also applied our method to 4 controlled methane releases that were done at the same time as the Sentinel 2 satellite passes, that gives a solid validation that our approach does automatize methane detection in the 1+ ton/h range.
Our point-by-point response follows.
Major Comments: 1.There are several claims made in the paper which are poorly supported by the results presented.1(a).Importantly, I am not convinced from the results that the major conclusion of the paper (the only result highlighted in the Abstract, lines 20-/0') g[Tg g[X `bWX_ VTa peX_\TU_l WXgXVgq g[X cb\ag sources down to 200kg CH4 h-1, is true.1(a)(i).Figure 3 shows that the model detected ~80% of the expected visible plumes between 200 and 350 kg/h then about 90-95% between 400 and 1000.Missing 20% of plumes seems substantial.Without a baseline or reference method to compare against, claiming this is reliable detection is subjective.Furthermore, determining these aggregate numbers of how many plumes were detected above some plume rate, which are key to the main conclusions of the paper, was not easy to do from the figure.Those should be presented numerically or at least shown more clearly in a revised or additional figure.
Thank you very much for your comment and suggestions.We did our best to modify the main claims of the paper and align them with our results.As mentioned above, we also spent a lot of time building a larger dataset and further training the model, and we were able to lower our false positive rate on the test set from 0.5% to 0.03% pixel-wise.This leads to much lower false positives as reported in Figure 3.
Methane detection in Sentinel 2 data is currently limited to 1 ton/h+ leaks (and often struggle to detect plumes below 8-10 tons/h, including in critical areas such as the Permian basin [3,4]).Current approaches consist in manual identification of plumes in very noisy methane retrieval fields such as the MBMP method, which we compare our method against (there is currently no method that automatically identifies methane plumes in Sentinel 2).Importantly in the context of this review, there are also no previous examples of methane leak detections in Sentinel 2 data below 1ton/hour, which makes the validation of our approach on small leaks particularly challenging.
To answer the comment more specifically, we only have a rough idea of the persistence of leaks in the catalog (the 20-26% estimate indicated in the figure), and it is not clear how persistence varies in time and with leak rate.The persistence of leaks and their variation in time is an active area of research (see for example Cusworth et al., Environ.Sci.Technol.Lett.2021, 8, 7, [2]), and beyond the scope of our paper.
In order to try and make the comparison more robust, we have modified our figure to only consider satellite detection within 7 days of the airborne detection (compared with 3 months in the previous version of the manuscript).This enables us to detect a larger fraction of the leaks while at the same time having a more conservative detection threshold.However, because our model is still applied to data up to 7 days apart from the airborne detection, we cannot guarantee that any particular leak is still there.The goal of this exercise, and that of Fig. 3, is to show a clear statistical relationship between our satellite detections and the airborne detections for leaks below 1 ton per hour.We have toned down our wording however, as indeed our comparison with airborne detection ba_l Z\iXf hf Ta \aW\eXVg cebbY TaW \f abg XabhZ[ gb V_T\`peX_\TU_X WXgXVg\baq+ Furthermore, we added an analysis of a controlled release methane experiment in the Supplementary Materials.In this recent experiment conducted by Serwin et al., four 1ton/h+ methane releases were timed with Sentinel 2 passes.One of these plumes (the 1.1 ton/h release) corresponds to the smallest plume ever detected so far in Sentinel 2 data in the literature.
Figures S3 and S4 shows the application of our model to these controlled methane releases, with our model having once again no information on the location of the plume nor the local wind (which was given to the participants of the test).Figure S4 also shows methane retrieval fields computed by the groups participating in the blind test, and illustrate the amounts of false positives that are present in state-of-the-art analyzes.Masked detections from these retrieval fields typically involve analysis by a human operator.1(a)(ii).The false positive rate was approximated and therefore it is not clear how reliably the model will detect plumes in practice.Furthermore, even if the estimated false positive rate of 3-4% is true, that corresponds to at least 2,400 false positives 80,000 images.The authors state a few times that future work will focus on reducing these, but the limitation should be acknowledged more explicitly, and the major conclusion should be tempered accordingly.
Following this comment, we have been working hard for the last 3 months at reducing the false positive rate of our approach.Notably, we have trained the model on 3 times more samples (1.6 million samples compared to the previous 550 thousand) and we have included more difficult training data, in wetter areas (including Canada) and with more cloud cover (up to 25% as opposed to the previous 5%).
All in all, our pixel-wise false positive rate on our test set is now down to less than 0.03% (shown in Figure 2), compared with our previous 0.5%, which translates to a less than 1% false leak detection rate (shown in Figure 3, blue and green dots) with the criteria used for detecting catalogued leaks.Our main objective here jTf gb f[bj g[X dhT_\gl bY bhe `bWX_sf bhgchg Vb`cTeXW gb fgTgX-of-the-art methane retrieval fields.Other standard masking techniques such as water and landcover masks would lower our false positive rates, and is the topic of further work (along with further increasing the size of the training set).
We have added the following paragraph to the main text, that explains further experiments we have done to investigate the false positive rate of our approach: (l.224) "Fig.3 also summarizes our detection rate when applying the exact same methodology in the absence of known leaks, for three different tests.i) The blue bin shows our average detection rate when applying our model to pairs of Sentinel 2 images over southern New Mexico (but away from the Permian Basin), and provides an estimated false positive rate of 0.7% in conditions similar to that of the Permian Basin.ii) The green bin shows our false detection rate using the exact same methodology as elsewhere in this figure, but for pairs of Sentinel 2 images from our test set with no plume embedded.This yields a false positive rate of 0.9%, which is in agreement with a pixel-wise YT_fX cbf\g\iX eTgX Xfg\`TgXW UX_bj -+-0$ &Tf f[bja \a @\Z+ /'+q 1(b).Lines 69-70 state their approach fully automated monitoring of emissions, but the approach produces a nontrivial amount of false negative and false positives detections which may prohibit practical use.This claim needs to be changed or better supported by the analysis in the paper.
Thank you for your comment.As described above, we were able to significantly reduce false positives rates.We also toned down some of the claims in the paper.
1(c).The authors make the claim on lines 82-84 that the database contains areas that do not encompass known potential methane sources, but they do not thoroughly describe how they ensured this while sampling, they do not supply an exhaustive list of potential methane sources, and they state they were only measured far from oil and gas activity on line 160, which I believe puts into question whether the claim on lines 82-84 is true.The sampling process should be described in more detail, and if the authors believe there could be known potential sources in the negatives, this claim should be modified and conclusions made from results should be adjusted accordingly if needed.
Thank you for your suggestion.When sampling the data, we performed a visual and landcover analysis of the areas covering the extent of each selected tile to try to ensure that there were no oil and gas facilities, landfills, natural gas plants, or other obvious sources of potential methane point sources.This is of course not entirely foolproof, but corresponds to the best that we can possibly think of in order to build a dataset relying on real Sentinel-2 data.
We have added the following paragraph to the Methods section: &_+0/4' pPX cXeYbe`XW T i\fhT_ \afcXVg\ba bY g[X bcg\VT_ TaW _TaWVbiXe XkgXag bY g[X fX_XVgXW g\_Xf gb avoid, as much as possible, potential sources of point-fbheVX `Xg[TaX X`\ff\bafq+ In order to evaluate if our training, validation and testing data are really methane free, we compare our false positive rate on our database (green and blue points in Fig. 3), to our false positive rate when applying our model in oil and gas basins at random (red point in Fig. 3).The dramatically enhanced random detection rate near oil and gas activity (4 to 5 times, from 0.7-0.9% to close to 4%) shows that methane sources are at least much rarer in our database.Because there is no ground truth with respect to methane emissions, we argue that this fits the description that there are pgXagTg\iX_lq ab `Xg[TaX fbheVXf \a bhe WTgTUTfX+ We have added the following description to this new experiment: &_+/0/' pM[X eXW U\a f[bjf bhe WXXc _XTea\aZ `bWX_sf WXgXVg\ba eTgX biXe g[X eXZ\baf fheiXlXW \a g[X =TeUba GTccXe VTgT_bZhX) Uhg Tg eTaWb`_bVTg\baf TaW g\`Xf &\afgXTW bY VXagXe\aZ g[X `bWX_sf input on known leaks).This detection rate essentially shows the chances of detecting a different leak (and/or making a false detection) at random when assessing the detection of a particular _XT^bY \agXeXfg \a Ta TeXT VbagT\a\aZ `Xg[TaX fbheVXf+q 2. Several methodological decisions were made without sufficient motivation and explanation, some of which may affect the results and conclusions made from the paper.
Thank you for your comment and suggestions.We have tried to add more details about the motivation of the approach in the main text, and to add explanations regarding individual operations.These changes are described for each listed point below: 2(a).Why use a single pair of consecutive times?Why only predict a single plume image?Does the plume correspond to one of the times, the difference, or something else?
The reliance on two time-steps has been chosen as an analogy to the gold standard method used to detect methane in Sentinel-2 data, the Multi-band multi-pass (MBMP) approach compared against in several figures of the paper.The MBMP method focuses on the analysis of band ratios (using bands 11 and 12) at two time-steps.This is because many of the noise sources in Sentinel-2 data are due to surface properties and land cover, and are more or less independent of the time at which they are captured.The use of two time-steps allows the model to discriminate between persistent SWIR absorption in the same exact location (likely due to noise) and transient SWIR absorption at the exact same location (which may be due to methane).
The output of the model corresponds to the plume detected in the second image.This is because &f\`\_Te_l Tf G<GJ' g[X Y\efg \`TZX \a g\`X f[bh_W UX haWXefgbbW Tf T reXYXeXaVX \`TZXs) gb j[\V[ the second is compared to identify local absorption patterns in SWIR that are not persistent in time and space.
To improve the clarity of the description, and emphasizes its parallel with the current state-of-theart (MBMP), we have added the following paragraph to the main text: &_+.14' pM[X hfX bY gjb g\`X-steps as input (conceptually similar to the MBMP approach) enables the model to use the first image as a reference image, to which the second image is compared in order to identify transient signals in methane absorbing band 12, while false positives in band 12 can be discriminated using the other bands and their evolution over the two time-steps.This comparison is crucial to distinguish signal from noise and correctly detect methane plumes, in cTeg\Vh_Te g[X f`T__Xe baXf+q 2(b).Why are all input bands sampled to 20m? How is the downsampling for 10m bands and upsampling for 60m bands done?
The input bands are sampled to 20m because this is the resolution of the SWIR bands in Sentinel-2.The particular band typically used to detect methane, B12, has a 20m resolution.Keeping this resolution leaves the most important bands of the data with respect to methane detection (bands 11 and 12) untouched.The resampling of the bands with a different resolution than 20m is done by nearest neighbor resampling, and we have added this details to the Methods section.&_+001' p;__ fcXVgeT_ UTaWf TeX eX-sampled to the 20m resolution of band 12, using nearest neighbor eXfT`c_\aZ+ PX ^XXc g[X eXfb_hg\ba bY g[X LPCK UTaWf haV[TaZXWq+ 2(c).What operation is used to perform the upsampling in the UNet architecture?
We have added details describing our deep learning model, including details on upsampling, which is done using deconvolutional layers (Pytorch ConvTranspose layers).We added this information in the caption of Figure 3.

2(d).
Why did the authors design and use the proposed deep learning architecture rather than other well-established architectures?ViTs are known to struggle with dense prediction tasks, so comparing to well-established CNN-based segmentation architectures and ViT-based segmentation architectures that have been modified for dense prediction would much better support the use of this architecture, should it outperform those approaches.Furthermore, several implementation details were left out, like the optimization procedure (optimizer, learning rate, batch size, etc.) for example.
We have added details on the optimizer, learning rate, batch size etc. to the Methods section: &_+04.' pIhe WXXc _XTea\aZ `bWX_ \f geT\aXW ba UTgV[Xf bY 31 fT`c_Xf bY cT\ef bY ./5k./5 LXag\aX_ / tiles, in which random Gaussian plumes are embedded.The model is trained for 10 epochs using the Adam variation of stochastic gradient descent, with a learning rate that starts at 10 -3 and is progressively reduced by 0.1% when there is no improvement in validation performance after 10 UTgV[Xf+ M[X `bWX_ g[Tg [Tf g[X UXfg cXeYbe`TaVX ba g[X iT_\WTg\ba fXg \f ^Xcg+q We have found that the main difference between our transformer auto-encoder and pure CNN autoencoder architectures is that the transformer architecture is able to determine that methane signals downwind from a plume it has detected are more likely to be real methane signals.In other words, the transformer-based model is able to recognize long-range and anisotropic dependencies in the signal we are looking for.This is a subtle difference that is not obvious in raw performance metrics, but still noticeable, and we have added performance figures similar to Fig. 2 comparing a purely convolutional Unet-resnet architecture to the transformer-unet architecture, in Fig. S6.
/&X'+ P[l fc_\g WTgTfXgf Ul eXZ\ba \Y g[X WTgT \f flag[Xg\VT__l ZXaXeTgXW: Pbh_Wasg baX jTag gb include all regions in the training set to improve the generalization, rather than excluding some regions?
Thank you for your comment.We could indeed do this, but our goal here was to demonstrate that the model generalizes well.In particular, showing its performance on a test set sampled from W\YYXeXag eXZ\baf f[bjf g[Tg \g W\Wasg f\`c_l biXeY\g g[X geT\a\aZ eXZ\baf+ M[\f \f T UXggXe demonstration of the robustness of the model.We note that the training data is only partially synthetic, with the background Sentinel 2 being real data in which we embed synthetic plumes, and generalizing to different Sentinel 2 backgrounds is therefore critical.
Fine-tuning the model to areas of application could be helpful when used in practice.We added the following paragraph to the main text to emphasize this point: &_+/2/' p@\aXgha\aZ ba fcXV\Y\V eXZ\baf bY \agXeXfg Vbh_W T_fb UX cXeYbe`XW \a beWXe gb \`cebiX cXeYbe`TaVX TaW _\`\g YT_fX cbf\g\iXf j[Xa hf\aZ g[X `bWX_ \a ceTVg\VXq+ 2(f).Regarding the argument made on lines 178 -181, even though training on synthetic data was not centered around the plumes, doing so when testing the model is biased.If the goal is to understand how well the model works for detecting real methane plumes in practice, the locations of the methane plumes and often even the potential sources will not be known apriori and therefore the images will not be centered on the plumes.The evaluation needs to be redone without centering it in order to identify whether the model is robust to perturbations in the locations of the real methane plumes.
The evaluation in the test set (Figure 2) is not done using centered plumes.The plumes are only centered for the application to real airborne detections, for visualization purposes.We have added a aXj Y\ZheX \a g[X Lhcc_X`XagTel GTgXe\T_f &@\ZheX L2' TaT_lm\aZ g[X W\fge\Uhg\ba bY bhe `bWX_sf detections on test set data, that illustrates that the model is not positively biased towards detections in the center of the image.
3. No limitations are stated even though there are several, including but likely not limited to (1) the use of randomly sampled negatives which may have methane emissions, which is likely reducing the sensitivity and leads to a rough approximation the false positive rate, (2) the approximated false positive rate is nontrivial, (3) the uncertainty of real-world detection accuracy arising from intermittency of plumes.
In the 3 months we have been working on the revised version of this paper, we have tried to address to some extent these limitations, by: i) reducing our false positive rate by further training the model on more data, increasing the amount of wet and cloudy training samples while avoiding cloudy samples when assessing against the Carbon Mapper methane leaks; ii) comparing our false leak detection rate away from oil and gas activity (0.7 to 0.9%) and near oil and gas activity (around 4%) and iii) demonstrating the performance of our model on controlled releases, that our model correctly and blindly detects.
There remain lots of uncertainty for real-world detections below 1 ton/hour (due to the absence of previously detected leaks in Sentinel 2 data below this threshold, preventing any direct comparison), and we have added caveats and discussions in this regard throughout the manuscript.HbgTU_l) jX abj \af\fg ba g[X VTiXTg g[Tg bhe `bWX_sf cXeYbe`TaVX \f WXcXaWXag ba c_h`X f\mX rather than on source rate.
;UfgeTVg &_+/-' pPX Vb`cTeX bhe WXgXVg\baf j\g[ T\eUbeaX `Xg[TaX `XTfheX`Xag VT`cT\Zaf TaW show that our method is able to detect most methane point sources in Sentinel 2 multi-spectral satellite data down to plumes of 0.01 km 2 , corresponding to 200 to 300 kg CH 4 h -1 sources (with variations depending on wind conditions).The assessment of our model on synthetic data as well as on real methane plumes results in an order of magnitude improvement over the state-of-the-art, providing a significant step towards the automated, high resolution detection of methane emissions Tg T Z_bUT_ fVT_X) XiXel YXj WTlf+q CagebWhVg\ba &_+36' pIhe eXfh_gf fhZZXfg g[Tg bhe `bWX_ WXgXVgf `bfg `Xg[TaX X`\ff\baf Wbja gb plumes of 0.01 km 2 , corresponding to methane leak rates of 200 to 300kg/h (with variations depending on wind conditions), which accounts for the vast majority of the estimated methane budget coming from point sources for most airborne campaigns in California, Colorado, and the JXe`\Ta <Tf\a TaT_lmXW \a T eXVXag fheiXl+q Thank you for this comment.We have modified the manuscript to ensure consistent spelling.
3. The Figures (specifically 1 and 3) could be substantially improved to improve readability.3 (a).Figure 1 could more clearly show how the images are being divided into patches, which is an important design decision.It could also be more aesthetically designed overall, including using visual representations of the operations that are more standard (and in my opinion, more clear).
Thank you for your suggestion.We have modified Figure 1 for greater clarity.
3 (b).In addition to the changes I suggested in the major comments, Figure 3 could include horizontal gridlines to improve readability.
Thank you for your suggestion, we modified Figure 3 accordingly.4. The Data availability and code availability suggest neither the curated dataset nor model training code will be made publicly available.However, the dataset and code would be fantastic resources for other researchers to be able to use and, in my opinion, releasing them would lead to the work making a much larger impact.Thank you for your comment.We have modified the main text accordingly.4. Line 114 and the caption state that the model is an auto-encoder but it is not.An auto-encoder attempts to reconstruct the input.This is an encoder-decoder segmentation architecture.
We have corrected the main text accordingly.5. Lines 158-160 claim the FPR is reliably evaluated, but I think the authors are trying to say that their estimated FPR is likely an upper bound of the true FPR, given some of the detected plumes may actually be real plumes, suggesting it is not reliably evaluated.Additionally, for the exact reason the FPR is likely overestimated, the FNR may be underestimated, which could be inflating the AUC.
Methane leaks away from oil and gas activity (or other potential sources not spotted during the visual inspection of the tiles), and that are detectable in Sentinel 2 data, are likely to be extremely eTeX XiXagf) g[Tg jX UX_\XiX jX VTa aXZ_XVg+ PX [TiX eX`biXW g[X V_T\`bY T peX_\TU_Xq Xfg\`Tg\ba bY the FPR, and we simply insist on the fact that the negative samples are not synthetic: &_+ .43'pC`cbegTag_l) jX abgX g[Tg g[X YT_fX cbf\g\iX eTgX bY bhe `bWX_ \f XiT_hTgXW [XeX ba eXT_ TaW unaltered Sentinel 2 data.Only the positive samples of our databases have a synthetic element, j[\_X g[X aXZTg\iX fT`c_Xf TeX be\Z\aT_ LXag\aX_ / WTgT+q 6.Why switch between kg/h and tons/h on line 167?Better to stay consistent with one (or report both).We modified the text for greater consistency.Discussion 1.I think the discussion does a better job of making more supportable conclusions than other parts of the paper.However, the claim on lines 233-236 again seems too strong as the authors do not ceXfXag eXfh_gf j[\V[ fhYY\V\Xag_l Vb`cTeX gb g[XfX VbafgX__Tg\baf+ CsW fhZZXfg fbYgXa\aZ g[\f V_T\`be including these results.
Thank you for your comment.We modified the claims as follow: &_+/61' p; g\`X_l Z_bUT_ `ba\gbe\aZ flfgX`Ybe `Xg[TaX) TU_X gb WXgXVg X`\ff\baf Wbja gb T YXj hundreds of kilos per hour, would be a fundamental stepping stone towards an automated, complementary and holistic monitoring system to building inventories of anthropogenic emissions at scale.In particular, such emissions account for the vast majority of U.S. methane point-sources in volume from methane-emitting areas analyzed in a recent survey of airborne campaigns in several States.By drastically lowering detection capabilities in multi-spectral data, our results suggest that public, general-purpose multi-spectral satellites can be turned into powerful methane monitoring tools capable of reaching detection performances approaching that of hyper-spectral constellations built specifically for methane detection, with the potential of generating global methane inventories at Y\aX fcTg\T_ TaW gX`cbeT_ fVT_Xf+q Methods 1.How were the Sentinel-2 images downloaded?M[X WTgT jTf Wbja_bTWXW hf\aZ g[X J?JL ;JC &T `\eebe bY ?L;sf eXcbf\gbel'+ PX TWWXW g[\f information to the Methods section.

How was cloud cover determined?
The cloud cover was determined using the metadata provided by ESA for each Sentinel-2 image.We added this information to the Methods section.
caption: as two different times -> at two different times -L.194 leaks -> leak -L.196 leat -> least -Fig.4 caption: showed -> show Thank you for catching these typos!We modified the text.

1 .
The syntax used throughout the paper is good.However, paragraph organization and structuring could use work.For example, why are lines 110-118 three separate paragraphs?Why are lines 218-224 separate paragraphs?Thank you, we modified the structure of the text./+M[X Thg[bef TeX abg Vbaf\fgXag j\g[ fcX__\aZ p`h_g\-fcXVgeT_q) fb`Xg\`Xf XkV_hW\aZ g[X [lc[Xa+ Cgwould be good to make this consistent and look for other potential inconsistencies throughout the paper.

Introduction 1 .
The background and motivation for the work is solid.However, the claims made in the last paragraph are exaggerated.Lines 67-70 again claim that the models robustly detect emissions down gb /--^Z,[) TaW g[Tg \g ha_bV^f Yh__l Thgb`TgXW `ba\gbe\aZ) j[\V[ C Wbasg g[\a^g[X VheeXag eXfh_gf fhYY\V\Xag_l fhccbeg Tf CsiX Xkc_T\aXW TUbiX+ Thank you for your comment.We have modified the claims accordingly: &_+40' pIhe TccebTV[ eXceXfXagf T f\Za\Y\VTag fgXc gbjTeWf g[X Thgb`TgXW `ba\gbe\aZ bY cXef\fgXag methane emissions, anywhere on Earth, every few days, and would require few modifications to enhance detection capabilities in other multi-and hyper-fcXVgeT_ VbafgX__Tg\baf+qResults.+ CgsW UX [X_cYh_ gb c_bg g[X _bVTg\baf bY g[X geT\a\aZ , iT_\WTg\ba , gXfg g\_Xf ba T `Tc) Tg _XTfg Tf T supplementary figure, so that readers can gain a better understanding of the geographic distribution of the splits.We added a new figure to the supplementary materials (FigureS7) to show this distribution.2.References toFig.2B and 2C are swapped in the text from lines 139-157 (or the figure/caption labels need to be swapped) Thank you for catching this!We modified the text.3. The model is not a transformer.It is a semantic segmentation model with a vision transformer XaVbWXe+ CsW fhZZXfg `bW\Yl\aZ _TaZTZX TebhaW g[\f g[ebhZ[bhg g[X cTcXe &X+Z+ _\aXf .-/)..-') potentially including the title.
).The WRF-LES simulations are too slow to generate a large database for training o the same simulated plume can be captured at various times to improve the number of examples, but this still generates issues in terms of the diversity of the signals.In contrast, the simple Gaussian models take a fraction of a second to run, which enables us to generate large numbers of diverse signals, thereby leading to a larger and varied training dataset that enables to train a more robust model.In particular, Joyce et al., 2023 use a training/testing dataset of 9700 images, compared to our 550,000 image pairs in the original submission and more than 1,500,000 image pairs in the revised submission.We argue that less than a few hundred thousand samples is not enough data to properly train a deep learning model.Similarly, Bruno et al. 2023 train their deep learning model on a database of 6870 samples.
1. Can the vision transformer approach be used to quantify methane plumes?I was left wondering what one would do to quantify emissions after using the proposed model to detect the plumes.Perhaps this is future work, but it could use some discussion.Yes, this is possible, and some members in our team are currently working on this (but at this point jX [TiXasg Y\a\f[XW gb WXiX_bc \g'+ M[\f [Tf T_fb UXXa WbaX \a T Vbhc_X bY cTcXef \a g[X _\gXeTgheX+ ;abg[Xe TccebTV[ \f gb hfX T c_h`X `Tf^) Uh\_g ba gbc bY bhe `bWX_sf bhgchg) \a Vb`U\aTg\ba j\g[ standard quantification methods such as IME.We are currently favoring the second approach, as using deep learning for methane mask identification followed by a physical method for rate inversion somewhat avoids the pitfalls of the black-box nature of deep learning models.We have added the following text to discuss this in the paper: &_+/3/' pThe task of our deep learning model is only to detect the location of methane plumes in Sentinel 2 data, and the inversion for plume rate could be done as an additional step, using standard \aiXef\ba `Xg[bWf fhV[ Tf g[X CG? `Xg[bW be Ul Y\gg\aZ T AThff\Ta c_h`X+q /+ F+ 35 TaW F+ /0-7 M[X V_T\`TUbhg /--^Z,[ TaW pg[X iTfg `T]be\glq bY `Xg[TaX Yeb`cb\ag sources seems to refer to Figure 7 from Jacob et al. (2022), but that was for US aerial surveys, mostly of oil and gas basins.One of the surveys in that plot, for California, shows >200 kg/h accounting for less than half of the point source budget o TaW jX Wbasg ^abj `hV[ TUbhg g[X fbheVX distribution in other countries/industries.Some caution is needed here.