Supervised dimensionality reduction for big data

To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

the effectiveness of LOL for classification using LDA compared to PCA and random projections. The paper further develops a finite sample difference between LOL+LDA and PCA+LDA, as well as the difference in Chernoff information captured between each method for 2-class classification.
Summary of review I believe that this paper should be published in this venue; however, I would prefer to see some edits made to the paper before publication. I find the paper to be novel and insightful. I believe the LOL dimensionality reduction method is useful for improving 2-class classification performance in a low-sample, high-dimensional setting; while maintaining sound theoretical improvement guarantees over PCA. I am unconvinced of the applicability of LOL beyond the 2-class classification setting and I would have preferred to see a deeper investigation of other dimensionality reduction methods.
Theorem 1 suggests that the difference in Chernoff information between LOL and PCA is inversely proportional to the size of the d'th eigenvalue, implying (as stated in the main body of the paper) that the degree of benefit afforded by LOL is based on how much information is gained due to replacing the eigenvector of smallest explained variance with the mean difference vector. In general for C-class classification, this bound will become inversely proportional to the [d, d -1, ..., d -(C -2)] eigenvalues. Can the theory be extended to include the general C-class setting? How restrictive do the conditions for LOL improving over PCA get as C grows larger? In the 2-class setting, Theorem 1 appears to state that LOL always improves over PCA because the difference in Chernoff information is always greater than zero. Does this hold for C > 2?
The empirical section only investigates 2-class classification except for Figure 3a, which demonstrates that PCA outperforms LOL for a high number of embedded dimensions (which I'll call d). As C >> 2, then how does this impact the value of d at which PCA improves over LOL? My presumption is that this value of d will decrease as C increases because LOL replaces the bottom eigenvectors with C-1 mean difference vectors. I would like to see this further investigated to know whether LOL is robust to classification settings with larger values of C (either empirically or by extending the theory). This paper also discounts many dimensionality reduction methods and investigates only PCA and random projection methods. While I agree with the paper that many dimensionality reduction methods do not meet the goals stated by this paper, I do believe that some further exploration would be useful to more appropriately place this paper within the literature. For instance, I am unconvinced that supervised autoencoders should be entirely discounted. While I agree that deep non-linear SAEs are uninterpretable black-box methods with too many tunable parameters, I wonder how a fully linear SAE with only a single hidden layer would perform compared to PCA/LOL. In fact, such an AE (without the supervised component) has been shown to be the same function class as PCA, so how does the corresponding SAE compare to LOL? I would argue that the interpretability of this SAE is no worse (or better) than the interpretability of LOL.
Similar to the discussion on supervised autoencoders, I would have liked to see further discussion on supervised dictionary learning (the seminal SDL paper being cited by this paper, but not discussed). Extending SDL to out-of-sample prediction has been widely investigated and the goal of SDL is to directly parameterize the "projection" matrix that this paper seeks to hand-design. While the theory in this paper suggests that the hand-designed projection matrix is better than PCA, I wonder if it is likewise better than a direct parameterization/optimization using SDL (note this relationship can likewise been shown with SAEs, though it is obscured).
Paper describing SAEs for out-of-sample supervised classification (for transparency, note I am an author on this paper): Le, Lei, Andrew Patterson, and Martha White. "Supervised autoencoders: Improving generalization performance with unsupervised regularizers." Advances in Neural Information Processing Systems. 2018.
Minor comments/question LOL can be seen as making two modifications to PCA. It uses the class-conditional covariance matrix to choose the d highest variance eigenvectors, then it also replaces C-1 of those d eigenvectors with the difference between class conditional means. How do either of these modifications perform in isolation? Based on my understanding of Theorem 1, using the class conditional means is the most important novelty, but does that hold in practice as well?
In Figure 3, why does only LOL have a "robust" version? The performance gain of LOL in Figure 3b is negligible and much worse than many competitors, but "robust" LOL does quite well. Would using a robust PCA have the same effect? Or can only LOL be made robust in this way? The robust LOL method is never described that I could find it. My assumption is that you replaced the class-conditional mean with a class-conditional median. Likewise in Figure 3c, why not compare QOQ with QDA as well as QDA+PCA? Would this significantly outperform QOQ? Given that LDA and PCA both outperform LOL, I would expect that the quadratic versions would likewise outperform the quadratic version of LOL. If this is true, then that means LOL only outperformed its baseline in one of three simulated settings and only for small values of d in that one setting.
In Figure 4, why can't PCA use the randomized approximate algorithm for eigendecomposition? Would PCA not also be able to benefit greatly in terms of computational performance, especially considering that LOL is simply PCA with a classconditional mean centering preprocessing step (e.g. LOL and PCA rely on the exact same eigendecomposition code, LOL just has a subtraction step first).
What is Figure 5 showing for the LDA columns? Since RR-LDA is equivalent to LDA composed with PCA, then it can't be showing the reduced rank form. I'm not sure that LDA can be composed with LDA, that doesn't make sense to me (thus the label is wrong, I believe), so what is the LDA column in Figure 5a and row in Figure 5b? Is it LDA applied to the raw dataset without dimensionality reduction?
In Figure 6, you show LOL used for a regression problem. This is quite interesting, and in general I would love to see more exploration of LOL used with other classification algorithms (other than LDA). However, how can LOL be used for regression? How do you compute the class-conditional covariance and means for a regression problem? Without these, I would expect LOL to revert back to PCA. I believe appendix B has a few typos: -B.II: g^d_{LDA} is defined twice. I believe the first definition should be labelled g_{LDA} -B.III: \delta is defined as a matrix of size C-2 * p instead of C-1 * p. Should the last index in this array be \delta_C instead of \delta_{C-1}? -B.III: A^D_{LOL} is defined twice and in two different ways. The first definition assumes two-class and the second definition is more general for any number of C.

Andrew Patterson
Reviewer #3 (Remarks to the Author): Summary: The paper discusses the use of linear projection (PCA) per class followed by LDA to preserve the discriminatory power of features. The LOL method ensures that the projected data is useful for the subsequent linear classification problem.

Comments:
Unfortunately, the idea of appending the mean difference vector to features and classcentering to convert unsupervised manifold learning to a supervised manifold learning setting are not a substantial enough contribution to ML methodology. The claims of scalability are not interesting either. The draft has not been proofread properly has as elementary mistakes even in the background/preliminary sections such as A.I and A.II.
>> Existing linear and nonlinear dimensionality reduction methods ei-ther are not supervised, scale poorly to operate in big data regimes, lack theoretical guarantees, or are "black-box" methods unsuitable for many applications. This is an exaggerated mis-characterization of work in this area. There are multiple papers that address scalability, address theory of such dimensionality reduction, as well as are quite interpretable and transparent.
>> LOL achieves achieves state-of-the-art clas-sification accuracy This seems to be an exaggerated claim. Partly because restricting oneself to linear classification settings already limits the model representation power. What does state of the art even mean here?
Additional Comments: -The first expression for g_LDA(x) in pg 19 looks wrong. The first term should also be a function of y.
-Why did the narrative change to two class instead of the general setting in A.II pg 19 even before calling out for the two class setting (i.e., why \pi_o=\pi_1 before talking about the twoclass setting)?
-Lemma 1 is trivial/not new.
-What is the difference between L_A^d and L_A? Please be consistent in notation.
-In Eq (3), is \Sigma and \delta (which determine \Sigma_A and \delta_A) already known? This should have been clarified as this is very importantly and makes the optimization problem difficult/easy.
->> because we do not know how to evaluate the integral analytically Why is this a deal breaker. There are many problems where the population optimal estimation or model is not computable? Also, can you comment on what type of problem this is? non-convex? for what additional assumptions could it become convex?
-What is \mathcal{A}^d and \mathcal{A}? If they are the same, then how can \delta^T\Sigma^{-1} belong to this set. The dimensions don't match, so Lemma 2 is incorrect.

-Lemma 2 proof: what is \delta_B?
The point-by-point is organized as follows. Comments from reviewers are enumerated in black text. Our responses are provided in blue text. Relevant text changes which specifically address the reviewer's comment are indicated by the red text within the screenshot (and the corresponding manuscript).
Reviewer #1 (Remarks to the Author): This paper proposes a supervised dimension reduction approach called``linear optimal low-rank (LOL)' projection that combines both class-conditional means and principal component directions to construct projection matrices. Theoretically, the authors show that the LOL projection preserves more Chernoff information than those by PCA and LDA, implying higher discriminative capabilities. Simulation study illustrates the superiority of LOL (or its variants) over PCA, LDA and ROAD in terms of misclassification errors under various embedding dimensions. Real data analysis on high-dimensional brain imaging and genomic datasets lends further support to the statistical superiority and computational efficiency of LOL.
This work studies a very important problem of modern data challenges. The key idea of the proposed algorithm is to incorporate class-conditional means into dimension reduction. This paper has solid theory and sufficient numerical study. Below I list my comments on the paper: 1. From my perspective, the major limit of LOL is that it incorporates only class-conditional means to the dimension reduction. What if the classes share (nearly) the same means? What will be the consequences if one incorporates other class-specific statistics, such as elementise medians, into the dimension reduction procedure? I understand that incorporating the mean structure is a natural starting point and easy to analyze under Gaussian setups, but I expect more justification of this choice, particularly in the real data contexts.
Thank you for your feedback. We have updated our wording in the abstract, introduction, the content discussion of Figure 3(C), and discussion to better emphasize our contribution as XOX, a procedure for incorporating (in the general case) class conditional moment estimates into a supervised embedding, of which LOL, QOQ, and Robust LOL (RLOL) are special cases. Our goal was simply to use LOL as an example of such a framework, due to the fact that we can make strong theoretical guarantees while maintaining strong empirical performance despite the triviality of the modification. If the classes share nearly the same means, variants of XOX incorporating different class-conditional moments may be sensible, such as QOQ, as highlighted in Figure 3(C). XOX could incorporate any class-conditional moment estimates, such as the class-conditional medians as you mention, we simply chose not to for illustration.
For RLOL, we did use the class medians as the "robust estimate of the first moment" (as well as a robust estimate of the second moment); we have better clarified in the text.
Upon further review of all experiments, moreover, we realized that while a robust estimate of the second moment limits us to be able to only look at problems where the dimensionality is less than or equal to the sample size in standard available numerical packages (R::robust), a robust estimate of the first moment has no such requirement, so we have replaced all instances of the LOL simulations with a variant featuring a robust estimate of the first moment and a truncated estimate of the second moment via truncated svd. This had no impact on any results contained herein other than the Robust simulation, so we believe it is a simpler story to tell than to have separate embedding procedures for what is otherwise a linear classification problem in that particular figure.
Figure 3(C) shows a benchmark simulation in which the simulations share the same means (but have orthogonal covariance). Figure 3, it is interesting to compare all the methods under different embedding dimensions $d$, but I am more interested in comparison between all the methods with $d$ tuned by cross validation as in the real data analysis. I believe the misclassification error with tuned $d$ is a more convincing criteria to look at.

In
For each simulation, we have added a point corresponding to the optimal number of embedding dimensions, with the number of embedding dimensions tuned as-in the real data analysis. The misclassification error is seen to be approximately the lowest using LOL or one of its variants in all three settings. Figure 3, I see two variants of LOL called QOQ and RLOL that handle heterogeneous covariance matrices across classes and outliers respectively. They look interesting, and I suggest the authors elucidate these methods more. Particularly, QOQ also incorporates class-conditional covariance to construct projection matrices. This makes me wonder whether the authors can unify LOL, QOQ and RLOL in a general framework that uses class-conditional statistics to perform dimension reduction.

Finally, in
Thank you for this extremely insightful feedback. As mentioned in point 1, we attempted to unify the approach of "the use of class-conditional moments for embedding" into a unified framework, XOX, which allows one to fairly seamlessly generalize LOL to situations in which other class-conditional moments may be of use. We believe this readily harmonizes LOL, QOQ, RLOL, and other approaches discussed within the manuscript under a single, harmonious, framework.
Typos: Reviewer #2 (Remarks to the Author): Summary of paper This paper introduces a novel, supervised, linear dimensionality reduction method coined LOL (Linear optimal low-rank projection). LOL extends the popular, unsupervised, PCA method by (a) utilizing the class-conditional covariance and (b) appending class-conditional mean differences as features in the low-rank projection. The paper empirically demonstrates the effectiveness of LOL for classification using LDA compared to PCA and random projections. The paper further develops a finite sample difference between LOL+LDA and PCA+LDA, as well as the difference in Chernoff information captured between each method for 2-class classification.
Summary of review I believe that this paper should be published in this venue; however, I would prefer to see some edits made to the paper before publication. I find the paper to be novel and insightful. I believe the LOL dimensionality reduction method is useful for improving 2-class classification performance in a low-sample, high-dimensional setting; while maintaining sound theoretical improvement guarantees over PCA. I am unconvinced of the applicability of LOL beyond the 2-class classification setting and I would have preferred to see a deeper investigation of other dimensionality reduction methods.
Theorem 1 suggests that the difference in Chernoff information between LOL and PCA is inversely proportional to the size of the d'th eigenvalue, implying (as stated in the main body of the paper) that the degree of benefit afforded by LOL is based on how much information is gained due to replacing the eigenvector of smallest explained variance with the mean difference vector. In general for C-class classification, this bound will become inversely proportional to the [d, d -1, ..., d -(C -2)] eigenvalues. Can the theory be extended to include the general C-class setting? How restrictive do the conditions for LOL improving over PCA get as C grows larger? In the 2-class setting, Theorem 1 appears to state that LOL always improves over PCA because the difference in Chernoff information is always greater than zero. Does this hold for C > 2?
We appreciate this comment. In our attempt to extend a comparison of LOL with PCA and rrLDA under the C > 2 setting, we find that whereas PCA and rrLDA only depend on the covariance matrix, LOL also depends on the subspace spanned by the difference of the means. Therefore, a theoretical comparison of embeddings becomes far more complicated, which is a problem we think may warrant further exploration in a theoretical paper of its own. We have added a remark which frames the C > 2 class problem to serve as a stepping stone for future work.
The empirical section only investigates 2-class classification except for Figure 3a, which demonstrates that PCA outperforms LOL for a high number of embedded dimensions (which I'll call d). As C >> 2, then how does this impact the value of d at which PCA improves over LOL? My presumption is that this value of d will decrease as C increases because LOL replaces the bottom eigenvectors with C-1 mean difference vectors. I would like to see this further investigated to know whether LOL is robust to classification settings with larger values of C (either empirically or by extending the theory).
Thank you for your point; we have added Figure 6 to the supplement to explore the impact on the dimensionality as the number of embedding dimensions is increased, as the number of classes increases. We believe this supports that this figure indicates that LOL is robust to larger values of C under the given empirical setup.
This paper also discounts many dimensionality reduction methods and investigates only PCA and random projection methods. While I agree with the paper that many dimensionality reduction methods do not meet the goals stated by this paper, I do believe that some further exploration would be useful to more appropriately place this paper within the literature. For instance, I am unconvinced that supervised autoencoders should be entirely discounted. While I agree that deep non-linear SAEs are uninterpretable black-box methods with too many tunable parameters, I wonder how a fully linear SAE with only a single hidden layer would perform compared to PCA/LOL. In fact, such an AE (without the supervised component) has been shown to be the same function class as PCA, so how does the corresponding SAE compare to LOL? I would argue that the interpretability of this SAE is no worse (or better) than the interpretability of LOL.
Similar to the discussion on supervised autoencoders, I would have liked to see further discussion on supervised dictionary learning (the seminal SDL paper being cited by this paper, but not discussed). Extending SDL to out-of-sample prediction has been widely investigated and the goal of SDL is to directly parameterize the "projection" matrix that this paper seeks to hand-design. While the theory in this paper suggests that the hand-designed projection matrix is better than PCA, I wonder if it is likewise better than a direct parameterization/optimization using SDL (note this relationship can likewise been shown with SAEs, though it is obscured).
Paper describing SAEs for out-of-sample supervised classification (for transparency, note I am an author on this paper): Le, Lei, Andrew Patterson, and Martha White. "Supervised autoencoders: Improving generalization performance with unsupervised regularizers." Advances in Neural Information Processing Systems. 2018.
Thank you for your detailed clarification regarding SAEs and SDLs. We agree that devising a unified strategy with respect to direct estimation (e.g., PCA, LOL) compared to optimization-based techniques (e.g., SDLs, SAEs) presents an excellent future direction for the work, and have reflected our discussion paragraph to highlight the promise these strategies might provide.
Minor comments/question LOL can be seen as making two modifications to PCA. It uses the class-conditional covariance matrix to choose the d highest variance eigenvectors, then it also replaces C-1 of those d eigenvectors with the difference between class conditional means. How do either of these modifications perform in isolation? Based on my understanding of Theorem 1, using the class conditional means is the most important novelty, but does that hold in practice as well?
Thank you for this comment. We have clarified our text regarding the simulations in Figure 3 to clarify that the first C-1 dimensions of an LOL projection is equivalent to use of the class-conditional means in isolation. Further, the embedding technique we term "LDA" is really reduced-rank LDA as noted in "The Geometric Intuition of LOL". We have augmented our figures to be more specific about when we mean reduced-rank LDA and not the LDA classifier by renaming the LDA embedding "rrLDA". Indeed, rrLDA is equivalent to projection using only the class-conditional covariance matrix. Together, these two facts indicate the performance of these augmentations to PCA in isolation.
In Figure 3, why does only LOL have a "robust" version? The performance gain of LOL in Figure  3b is negligible and much worse than many competitors, but "robust" LOL does quite well. Would using a robust PCA have the same effect? Or can only LOL be made robust in this way? The robust LOL method is never described that I could find it. My assumption is that you replaced the class-conditional mean with a class-conditional median. Likewise in Figure 3c, why not compare QOQ with QDA as well as QDA+PCA? Would this significantly outperform QOQ? Given that LDA and PCA both outperform LOL, I would expect that the quadratic versions would likewise outperform the quadratic version of LOL.
If this is true, then that means LOL only outperformed its baseline in one of three simulated settings and only for small values of d in that one setting.
We appreciate this feedback. The purpose of this figure was not to show simulations in which LOL itself always works; rather, the purpose of this figure was to show that a generalization of the XOX framework would readily apply, depending on how the data presents, based on relationships that exist within the data (such as whether there are outliers column 2, or whether there is an disparate covariance per-class column 3). Class-conditional moments simply allow one to capture these increasingly less simple relationships rather elegantly and simply into a unified projection technique. We have clarified this fact in the text.
In Figure 4, why can't PCA use the randomized approximate algorithm for eigendecomposition? Would PCA not also be able to benefit greatly in terms of computational performance, especially considering that LOL is simply PCA with a class-conditional mean centering preprocessing step (e.g. LOL and PCA rely on the exact same eigendecomposition code, LOL just has a subtraction step first).
Thank you for this point. Our purpose here was not to show that LOL with a randomized eigendecomposition (LFL) was superior performance and efficiency wise to PCA; it was merely to highlight that LFL provides an efficiency improvement over LOL, while maintaining the enhancement in performance over PCA. We have better highlighted the aims of this figure in the caption and discussion of Figure 4 in the main text.
What is Figure 5 showing for the LDA columns? Since RR-LDA is equivalent to LDA composed with PCA, then it can't be showing the reduced rank form. I'm not sure that LDA can be composed with LDA, that doesn't make sense to me (thus the label is wrong, I believe), so what is the LDA column in Figure 5a and row in Figure 5b? Is it LDA applied to the raw dataset without dimensionality reduction?
We appreciate this point, and it appears our slight overloading of the term "LDA" has caused confusion. We have remedied this fact by replacing use of the term "LDA" when referring to the embedding technique with "rrLDA" throughout, which is the distinction between the embedding technique which uses the top d eigenvectors of the class-conditional covariance matrix (rrLDA) with the full classification technique (LDA, which is a classification technique which can be thought of as projecting the data using the class-conditional covariance matrix).
In Figure 6, you show LOL used for a regression problem. This is quite interesting, and in general I would love to see more exploration of LOL used with other classification algorithms (other than LDA). However, how can LOL be used for regression? How do you compute the class-conditional covariance and means for a regression problem? Without these, I would expect LOL to revert back to PCA.
Thank you for this point. We have clarified in section F. to explicitly separate regression and hypothesis testing. Further, as stated in the description in F.II regression is performed by partitioning the data into K-partitions based on the percentile of the target variable; we have clarified that we selected K=10 arbitrarily. I believe appendix B has a few typos: -B.II: g^d_{LDA} is defined twice. I believe the first definition should be labelled g_{LDA} -B.III: \delta is defined as a matrix of size C-2 * p instead of C-1 * p. Should the last index in this array be \delta_C instead of \delta_{C-1}? -B.III: A^D_{LOL} is defined twice and in two different ways. The first definition assumes two-class and the second definition is more general for any number of C.
Thank you for pointing out these typos. We have clarified accordingly.

Andrew Patterson
Reviewer #3 (Remarks to the Author): Summary: The paper discusses the use of linear projection (PCA) per class followed by LDA to preserve the discriminatory power of features. The LOL method ensures that the projected data is useful for the subsequent linear classification problem.

Comments:
Unfortunately, the idea of appending the mean difference vector to features and class-centering to convert unsupervised manifold learning to a supervised manifold learning setting are not a substantial enough contribution to ML methodology. The claims of scalability are not interesting either.
While whether an advance is interesting is subjective, we find the scalability to be interesting; particularly, we have been unable to find other tools that run on datasets with billions of features. If there are papers addressing similar scalability performance, we would be very interested to try them out.
The draft has not been proofread properly has elementary mistakes even in the background/preliminary sections such as A.I and A.II.
We appreciate you pointing out the typos in the appendix; we have gone through the entire appendix section and attempted to eliminate the ones noted (as well as additional typos that were found).
>> Existing linear and nonlinear dimensionality reduction methods either are not supervised, scale poorly to operate in big data regimes, lack theoretical guarantees, or are "black-box" methods unsuitable for many applications. This is an exaggerated mis-characterization of work in this area. There are multiple papers that address scalability, address theory of such dimensionality reduction, as well as are quite interpretable and transparent.
We agree our initial claim was too strong, and have amended our wording to reflect that this tends to be characteristic of work in the area, and is not necessarily the case for all existing linear and nonlinear dimensionality reduction techniques. Further, we have specified explicitly a focus on deep learning methods with respect to the specific wording of this paragraph, which was what we were trying to focus on initially.
>> LOL achieves achieves state-of-the-art classification accuracy This seems to be an exaggerated claim. Partly because restricting oneself to linear classification settings already limits the model representation power. What does state of the art even mean here?
We agree this claim was exaggerated as well, and have modified the wording in the abstract to be more specific regarding LOL outperforming readily scalable, linear, dimensionality reduction techniques, which are the prime focus of most of our comparisons.
Additional Comments: -The first expression for g_LDA(x) in pg 19 looks wrong. The first term should also be a function of y.
-Why did the narrative change to two class instead of the general setting in A.II pg 19 even before calling out for the two class setting (i.e., why \pi_o=\pi_1 before talking about the two-class setting)?
Thank you for the above two comments; we have rewritten this section to be more clear about what is happening, and fixed a typo regarding g_{LDA}(x).
-Lemma 1 is trivial/not new.
Thank you for this comment; we agree, we were merely providing them for reference for interested readers who were unfamiliar with them. We have since deleted the one-line proof and replaced it with a reference.
-What is the difference between L_A^d and L_A? Please be consistent in notation.
Thank you for this comment; we had forgotten the superscript d previously inside the integral.
-In Eq (3), is \Sigma and \delta (which determine \Sigma_A and \delta_A) already known? This should have been clarified as this is very importantly and makes the optimization problem difficult/easy. Why is this a deal breaker. There are many problems where the population optimal estimation or model is not computable? Also, can you comment on what type of problem this is? non-convex? for what additional assumptions could it become convex?
We appreciate this feedback, and have adjusted the text regarding this statement to clarify that we are specifically referring to the case where (\Sigma, \delta) are not known a priori. The problem becomes difficult due to its non-convexity. Further, we cannot evaluate the integral directly. One could solve a convex approximation through various sets of assumptions (such as assuming that \Sigma or \delta are known); we do not.
-What is \mathcal{A}^d and \mathcal{A}? If they are the same, then how can \delta^T\Sigma^{-1} belong to this set. The dimensions don't match, so Lemma 2 is incorrect.
Thank you for this comment; this was a typo; \mathcal{A} is the set of possible embeddings into \leq p dimensions, whereas \mathcal{A}^d is the set of possible embeddings into d dimensions.
We have clarified the definition accordingly, by plugging in B from the definition given in the preceding paragraph: