Transfer Learning with Kernel Methods

Transfer learning refers to the process of adapting a model trained on a source task to a target task. While kernel methods are conceptually and computationally simple models that are competitive on a variety of tasks, it has been unclear how to develop scalable kernel-based transfer learning methods across general source and target tasks with possibly differing label dimensions. In this work, we propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. For both applications, we identify simple scaling laws that characterize the performance of transfer-learned kernels as a function of the number of target examples. We explain this phenomenon in a simplified linear setting, where we are able to derive the exact scaling laws.


Introduction
Transfer learning refers to the machine learning problem of utilizing knowledge from a source task to improve performance on a target task.Recent approaches to transfer learning have achieved tremendous empirical success in many applications including in computer vision [17,45], natural language processing [16,40,43], and the biomedical field [15,19].Since transfer learning approaches generally rely on complex deep neural networks, it can be difficult to characterize when and why they work [44].Kernel methods [46] are conceptually and computationally simple machine learning models that have been found to be competitive with neural networks on a variety of tasks including image classification [3,29,42] and drug screening [42].Their simplicity stems from the fact that training a kernel method involves performing linear regression after transforming the data.There has been renewed interest in kernels due to a recently established equivalence between wide neural networks and kernel methods [2,25], which has led to the development of modern, neural tangent kernels (NTKs) that are competitive with neural networks.Given their simplicity and effectiveness, kernel methods could provide a powerful approach for transfer learning and also help characterize when transfer learning between a source and target task would be beneficial.However, developing an algorithm for transfer learning with kernel methods for general source and target tasks has been an open problem.In particular, while there is a standard transfer learning approach for neural networks that involves replacing and re-training the last layer of a pre-trained network, there is no known corresponding operation for kernels.The limited prior work on transfer learning with kernels focuses on applications in which the source and Figure 1: Our framework for transfer learning with kernel methods for supervised learning tasks.After training a kernel method on a source task, we transfer the source model to the target task via a combination of projection and translation operations.(a) Projection involves training a second kernel method on the predictions of the source model on the target data, as is shown for image classification between natural images and house numbers.(d) Projection is effective when the predictions of the source model on target examples provide useful information about target labels; e.g., a model trained to classify natural images may be able to distinguish the images of zeros from ones by using the similarity of zeros to balls and ones to poles.(c) Translation involves adding a correction term to the source model, as is shown for predicting the effect of a drug on a cell line.(d) Translation is effective when the predictions of the source model can be additively corrected to match labels in the target data; e.g., the predictions of a model trained to predict the effect of drugs on one cell line may be additively adjustable to predict the effect on new cell lines.
target tasks have the same label sets [14,30,37].Examples include predicting stock returns for a given sector based on returns available for other sectors [30] or predicting electricity consumption for certain zones of the United States based on the consumption in other zones [37].These methods are not applicable to general source and target tasks with differing label dimensions, which includes classical transfer learning applications such as using a model trained to classify between thousands of objects to subsequently classify new objects.
In this work, we present a general framework for performing transfer learning with kernel methods.Unlike prior work, our framework enables transfer learning for kernels regardless of whether the source and target tasks have the same or differing label sets.Furthermore, like for transfer learning methodology for neural networks, our framework allows transferring to a variety of target tasks after training a kernel method only once on a source task.To provide some intuition for our proposed framework, instead of replacing and re-training the last layer of a neural network as is standard for transfer learning using neural networks, our approach for transfer learning using kernels translates to adding a new layer to the end of a neural network.
The key components of our transfer learning framework are: Train a kernel method on a source dataset and then apply the following operations to transfer the model to the target task.
• Projection.We apply the trained source kernel to each sample in the target dataset and then train a secondary model on these source predictions to solve the target task; see Fig. 1a.
• Translation.When the source and target tasks have the same label sets, we train a correction term that is added to the source model to adapt it to the target task; see Fig. 1c.
Projection is effective when the source model predictions contain information regarding the target labels.
We will demonstrate that this is the case in image classification tasks in which the predictions of a classifier trained to distinguish between a thousand objects in ImageNet32 [11] provides information regarding the labels of images in other datasets such as street view house numbers (SVHN) [33]; see Fig. 1b.In particular, we will show across 23 different source and target task combinations that kernels transferred using our approach achieve up to a 10% increase in accuracy over kernels trained on target tasks directly.
On the other hand, translation is effective when the predictions of the source model can be corrected to match the labels of the target task via an additive term.We will show that this is the case in virtual drug screening in which a model trained to predict the effect of a drug on one cell line can be adjusted to capture the effect on a new cell line; see Fig. 1d.In particular, we will show that our transfer learning approach provides an improvement to prior kernel method predictors [42] even when transferring to cell lines and drugs not present in the source task.
Interestingly, we observe that for both applications, image classification and virtual drug screening, transfer learned kernel methods follow simple scaling laws; i.e., how the number of available target samples effects the performance on the target task can be accurately modelled.As a consequence, our work provides a simple method for estimating the impact of collecting more target samples on the performance of the transfer learned kernel predictors.In the simplified setting of transfer learning with linear kernel methods we are able to mathematically derive the scaling laws, thereby providing a mathematical basis for the empirical observations.Overall, our work demonstrates that transfer learning with kernel methods between general source and target tasks is possible and demonstrates the simplicity and effectiveness of the proposed method on a variety of important applications.

Results
In the following, we present our framework for transfer learning with kernel methods more formally.Since kernel methods are fundamental to this work, we start with a brief review.
Given training examples X = [x (1) , . . ., x (n) ] ∈ R d×n , corresponding labels y = [y (1) , . . ., y (n) ] ∈ R 1×n , a standard nonlinear approach to fitting the training data is to train a kernel machine [46].This approach involves first transforming the data, {x (i) } n i=1 , with a feature map, ψ, and then performing linear regression.To avoid defining and working with feature maps explicitly, kernel machines rely on a kernel function, K : R d × R d → R, which corresponds to taking inner products of the transformed data, i.e., K(x (i) , x (j) ) = ψ(x (i) ), ψ(x (j) ) .The trained kernel machine predictor uses the kernel instead of the feature map and is given by: f (x) = αK(X, x), where α = arg min and Note that for datasets with over 10 5 samples, computing the exact minimizer α is computationally prohibitive, and we instead use fast, approximate iterative solvers such as EigenPro [31].For a more detailed description of kernel methods see Appendix A.
For the experiments in this work, we utilize a variety of kernel functions.In particular, we consider the classical Laplace kernel given by K(x, x) = exp (−L x − x 2 ), which is a standard benchmark kernel that has been widely used for image classification and speech recognition [31].In addition, we consider recently discovered kernels that correspond to infinitely wide neural networks: While there is an emerging understanding that increasingly wider neural networks generalize better [5,32], such models are generally computationally difficult to train.Remarkably, recent work identified conditions under which neural networks in the limit of infinite width implement kernel machines; the corresponding kernel is known as the Neural Tangent Kernel (NTK ) [25].In the following, we use the NTK corresponding to training an infinitely wide ReLU fully connected network [25] and also the convolutional NTK (CNTK) corresponding to training an infinitely wide ReLU convolutional network [2]. 1nlike the usual supervised learning setting where we train a predictor on a single domain, we will consider the following transfer learning setting from [50], which involves two domains: (1) a source with domain X s and data distribution P s ; and (2) a target with domain X t and data distribution P t .The goal is to learn a model for a target task f t : X t → Y t by making use of a model trained on a source task f s : X s → Y s .We let c s and c t denote the dimensionality of Y s and Y t respectively, i.e. for image classification these denote the number of classes in the source and target.Lastly, we let (X s , y s ) ∈ X ns s × Y ns s and (X t , y t ) ∈ X nt t × Y nt t denote the source and target dataset, respectively.Throughout this work, we assume that the source and target domains are equal (X s = X t ), but that the data distributions differ (P s = P t ).
Our work is concerned with the recovery of f t by transferring a model, fs , that is learned by training a kernel machine on the source dataset.To enable transfer learning with kernels, we propose the use of two methods, projection and translation.We first describe these methods individually and demonstrate their performance on transfer learning for image classification using kernel methods.For each method, we empirically establish scaling laws relating the quantities n s , n t , c s , c t to the performance boost given by transfer learning, and we also derive explicit scaling laws when f t , f s are linear maps.We then utilize a combination of the two methods to perform transfer learning in an application to virtual drug screening.Code and hardware details are available in Appendix L.

Transfer learning via projection
Projection involves learning a map from source model predictions to target labels and is thus particularly suited for situations where the number of labels in the source task c s is much larger than the number of labels in the target task c t .
Definition 1.Given a source dataset (X s , y s ) and a target dataset (X t , y t ), the projected predictor, ft , is given by: ft (x) = fp ( fs (x)), where fp := arg min {f :Ys→Yt} where fs is a predictor trained on the source dataset. 2hile Definition 1 is applicable to any machine learning method, we focus on predictors fs and fp parameterized by kernel machines given their conceptual and computational simplicity.As illustrated in Fig. 1a and b, projection is effective when the predictions of the source model already provide useful information for the target task.
Improving kernel-based image classifier performance with projection.We now demonstrate the effectiveness of projected kernel predictors for image classification.In particular, we first train kernels to classify among 1000 objects across 1.28 million images in ImageNet32 and then transfer these models to 4 different target image classification datasets: CIFAR10 [28], Oxford 102 Flowers [35], Describable Textures Dataset (DTD) [12], and SVHN [33].We selected these datasets since they cover a variety of transfer learning settings, i.e. all of the CIFAR10 classes are in ImageNet32, ImageNet32 contains only 2 flower classes, and none of DTD and SVHN classes are in ImageNet32.A full description of the datasets is provided in Appendix B.
For all datasets, we compare the performance of 3 kernels (the Laplace kernel, NTK, and CNTK) when trained just on the target task, i.e. the baseline predictor, and when transferred via projection from Ima-geNet32.Training details for all kernels are provided in Appendix C. In Fig. 2a, we showcase the improvement of projected kernel predictors over baseline predictors across all datasets and kernels.We observe that projection yields a sizeable increase in accuracy (up to 10%) on the target tasks, thereby highlighting the  effectiveness of this method.It is remarkable that this performance increase is observed even for transferring to Oxford 102 Flowers or DTD, datasets that have little to no overlap with images in ImageNet32.
In Appendix Fig. 5a, we compare our results with those of a finite-width neural network analog of the (infinitewidth) CNTK where all layers of the source network are fine-tuned on the target task using the standard cross-entropy loss [20] and the Adam optimizer [27].We observe that the performance gap between transferlearned finite-width neural networks and the projected CNTK is largely influenced by the performance gap between these models on ImageNet32.In fact, in Appendix Fig. 5a, we show that finite-width neural networks trained to the same test accuracy on ImageNet32 as the (infinite-width) CNTK yield lower performance than the CNTK when transferred to target image classification tasks.
The computational simplicity of kernel methods allows us to compute scaling laws for the projected predictors.In Fig. 2b, we analyze how the performance of projected kernel methods varies as a function of the number of target examples, n t , for CIFAR10 and Oxford 102 Flowers.The results for DTD and SVHN are presented in Appendix Fig. 6a and b.For all target datasets, we observe that the accuracy of the projected predictors follows a simple logarithmic trend given by the curve a log n t + b for constants a, b (R 2 values on all datasets are above 0.95).By fitting this curve on the accuracy corresponding to just the smallest five values of n t , we are able to predict the accuracy of the projected predictors within 2% of the reported accuracy for large values of n t (see Appendix D and Appendix Fig. 8).The robustness of this fit across many target tasks illustrates the practicality of the transferred kernel methods for estimating the number of target examples needed to achieve a given accuracy.Additional results on the scaling laws upon varying the number of source examples per class are presented in Appendix Fig. 7 for transferring between ImageNet32 and CIFAR10.
In general, we observe that the performance increases as the number of source training examples per class increases, which is expected given the similarity of source and target tasks.
Lastly, we analyze the impact of increasing the number of classes while keeping the total number of source training examples fixed at 40k.Fig. 2c shows that having few samples for each class can be worse than having a few classes with many samples.This may be expected for datasets such as CIFAR10, where the classes overlap with the ImageNet32 classes: having few classes with more examples that overlap with CIFAR10 should be better than having many classes with fewer examples per class and less overlap with CIFAR10.A similar trend can be observed for DTD, but interestingly, the trend differs for SVHN, indicating that SVHN images can be better classified by projecting from a variety of ImageNet32 classes (see Appendix Fig. 6).

Transfer learning via translation
While projection involves composing a map with the source model, the second component of our framework, translation, involves adding a map to the source model as follows.
Definition 2. Given a source dataset (X s , y s ) and a target dataset (X t , y t ), the translated predictor, ft , is given by: ft (x) = fs (x) + fc (x), where fc = arg min {f :Xt→Yt} where fs is a predictor trained on the source dataset. 3ranslated predictors correspond to first utilizing the trained source model directly on the target task and then applying a correction, fc , which is learned by training a model on the corrected labels, y t − fs (X t ).Like for the projected predictors, translated predictors can be implemented using any machine learning model, including kernel methods.When the predictors fs and fc are parameterized by linear models, translated predictors correspond to training a target predictor with weights initialized by those of the trained source predictor (proof in Appendix J).We note that training translated predictors is also a new form of boosting [9] between the source and target dataset, since the correction term accounts for the error of the source model on the target task.Lastly, we note that while the formulation given in Definition 2 requires the source and target tasks to have the same label dimension, projection and translation can be naturally combined to overcome this restriction.
Improving kernel-based image classifier performance with translation.We now demonstrate that the translated predictors are particularly well-suited for correcting kernel methods to handle distribution shifts in images.Namely, we consider the task of transferring a source model trained on CIFAR10 to corrupted CIFAR10 images in CIFAR10-C [22].CIFAR10-C consists of the test images in CIFAR10, but the images are corrupted by one of 19 different perturbations such as adjusting image contrast and introducing natural artifacts such as snow or frost.In our experiments, we select the 10k images of CIFAR10-C with the highest level of perturbation, and we reserve 9k images of each perturbation for training and 1k images for testing.In Appendix Fig. 9, we additionally analyze translating kernels from subsets of ImageNet32 to CIFAR10.
Again, we compare the performance of the three kernel methods considered for projection, but along with the accuracy of the translated predictor and baseline predictor, we also report the accuracy of the source predictor, which is given by using the source model directly on the target task.In Fig. 3a and Appendix Fig. 10, we show that the translated predictors outperform the baseline and source predictors on all 19 perturbations.Interestingly, even for corruptions such as contrast and fog where the source predictor is worse than the baseline predictor, the translated predictor outperforms all other kernel predictors by up to 11%.In Appendix Fig. 10, we show that for these corruptions, the translated kernel predictors also outperform the projected kernel predictors trained on CIFAR10.In Appendix Fig. 5b, we additionally compare with the performance of a finite-width analog of the CNTK by fine-tuning all layers on the target task with cross-entropy loss and the Adam optimizer.We observe that the translated kernel methods outperform the corresponding neural networks.Remarkably kernels translated from CIFAR10 can even outperform   Test accuracy of baseline kernel method (red), using source predictor given by directly applying the kernel trained on CIFAR10 to CIFAR10-C (gray), and transferred kernel method (green).The transferred kernel method outperforms the other models on all 19 corruptions and even improves on the baseline kernel method when the source predictor exhibits a decrease in performance.Additional results are presented in Appendix fine-tuning a neural network pre-trained on ImageNet32 for several perturbations (see Appendix Fig. 5c).
Analogously to our analysis of the projected predictors, we visualize how the accuracy of the translated predictors is affected by the number of target examples, n t , for a subset of corruptions shown in Fig. 3b.We observe that the performance of the translated predictors is heavily influenced by the performance of the source predictor.For example, as shown in Fig. 3b for the brightness perturbation, where the source predictor already achieves an accuracy of 60.80%, the translated predictors achieve an accuracy of above 60% when trained on only 10 target training samples.For the examples of the contrast and fog corruptions, Fig. 3b also shows that very few target examples allow the translated predictors to outperform the source predictors (e.g., by up to 5% for only 200 target examples).Overall, our results showcase that translation is effective at adapting kernel methods to distribution shifts in image classification.

Transfer learning via projection and translation in virtual drug screening
We now demonstrate the effectiveness of projection and translation for the use of kernel methods for virtual drug screening.A common problem in drug screening is that experimentally measuring many different drug and cell line combinations is both costly and time consuming.The goal of virtual drug screening approaches is to computationally identify promising candidates for experimental validation.Such approaches involve training models on existing experimental data to then impute the effect of drugs on cell lines for which there was no experimental data.
The CMAP dataset [47] is a large-scale, publicly available drug screen containing measurements of 978 landmark genes for 116,228 combinations of 20,336 drugs (molecular compounds) and 70 cell lines.This dataset has been an important resource for drug screening [7,41]. 4  demonstrated the effectiveness of low-rank tensor completion and nearest neighbor predictors for imputing the effect of unseen drug and cell line combinations in CMAP [23].However, these methods crucially rely on the assumption that for each drug there is at least one measurement for every cell line, which is not the case when considering new chemical compounds.To overcome this issue, recent work [42] introduced kernel methods for drug screening using the NTK to predict gene expression vectors from drug and cell line embeddings, which capture the similarity between drugs and cell lines.
In the following, we demonstrate that the NTK predictor can be transferred to improve gene expression imputation for drug and cell line combinations, even in cases where neither the particular drug nor the particular cell line were available when training the source model.To utilize the framework of [42], we use the control gene expression vector as cell line embedding and the 1024 bit circular fingerprints from [1] as drug embedding.All pre-processing of the CMAP gene expression vectors is described in Appendix E. For the source task, we train the NTK to predict gene expression for the 54,444 drug and cell line combinations corresponding to the 65 cell lines with the least drug availability in CMAP.We then impute the gene expression for each of the 5 cell lines (A375, A549, MCF7, PC3, VCAP) with the most drug availability.We chose these data splits in order to have sufficient target samples to analyze model performance as a function of the number of target samples.In our analysis of the transferred NTK, we always consider transfer to a new cell line, and we stratify by whether a drug in the target task was already available in the source task.
For this application we combine projection and translation into one predictor as follows.
Definition 3. Given a source dataset (X s , y s ) and a target dataset (X t , y t ), the projected and translated bations only.
predictor, fpt , is given by: where fs is a predictor trained on the source dataset and fs (x) | x ∈ Y s × X t is the concatenation of fs (x) and x.
Note that if we omit x, X t in the concatenation above, we get the projected predictor and if we omit fs in the concatenation above, we get the translated predictor.Generally, fs (x) and x can correspond to different modalities (e.g., class label vectors and images), but in the case of drug screening, both correspond to gene expression vectors of the same dimension.Thus, combining projection and translation is natural in this context.
Fig. 4a and b show that the transferred kernel predictors outperform both, the baseline model from [42] as well as imputation by mean (over each cell line) gene expression across three different metrics (R 2 , cosine similarity, and Pearson r value) on both tasks (i.e., transferring to drugs that were seen in the source task as well as completely new drugs).All metrics considered are described in Appendix F. All training details are presented in Appendix C. Interestingly, the transferred kernel methods provide a boost over the baseline kernel methods even when transferring to new cell lines and new drugs.But as expected, we note that the increase in performance is greater when transferring to drug and cell line combinations for which the drug was available in the source task.Fig. 4c and d show that the transferred kernels again follow simple logarithmic scaling laws (fitting a logarithmic model to the red and green curves yields R 2 > 0.9).We note that the transferred NTKs have better scaling coefficients than the baseline models, thereby implying that the performance gap between the transferred NTK and the baseline NTK grows as more target examples are collected.In Fig. 4e and f, we visualize the performance of the transferred NTK in relation to the top 2 principal components of gene expression for drug and cell line combinations.We generally observe that the performance of the NTK is lower for cell and drug combinations that are further from the control, i.e., the unperturbed state.Plots for the other 3 cell lines are presented in Appendix Fig. 11.In Appendix G and Appendix Fig. 12, we show that this approach can also be used for other transfer learning tasts related to virtual drug screening.In particular, we show that the imputed gene expression vectors can be transferred to predict the viability of a drug and cell line combination in the large-scale, publicly available Cancer Dependency Map (DepMap) dataset [13].

Theoretical analysis of projection and translation in the linear setting
In the following, we provide explicit scaling laws for the performance of projected and translated kernel methods in the linear setting, thereby providing a mathematical basis for the empirical observations in the previous sections.
Derivation of the scaling law for the projected predictor in the linear setting.We assume that and that f s and f t are linear maps, i.e., The following results provide a theoretical foundation for the empirical observations regarding the role of the number of source classes and the number of source samples for transfer learning shown in Fig. 2 as well as in [24].In particular, we will derive scaling laws for the risk, or expected test error, of the projected predictor as a function of the number of source examples, n s , target examples, n t , and number of source classes, c s .We note that the risk of a predictor is a standard object of study for understanding generalization in statistical learning theory [48] and defined as follows.
Definition 4. Let P be a probability density on R d and let x, x (i) i.i.d.∼ P for i = 1, 2, . . .n.Let X = [x (1) , . . ., x (n) ] ∈ R d×n and y = [w * x (1) , . . .w * x (n) ] ∈ R c×n for w * ∈ R c×d .The risk of a predictor ŵ trained on the samples (X, y) is given by By understanding how the risk scales with the number of source examples, target examples, and source classes, we can characterize the settings in which transfer learning is beneficial.As is standard in analyses of the risk of over-parameterized linear regression [4,6,18,21], we consider the risk of the minimum norm solution given by ŵ = arg min where X † is the Moore-Penrose inverse of X. Theorem 1 establishes a closed form for the risk of the projected predictor ωp ωs , thereby giving a closed form for the scaling law for transfer learning in the linear setting; the proof is given in Appendix H.
, and let ωs = y s X † s and ωp = y t (ω s X t ) † .Assuming that P s and P t are independent, isotropic distributions on R d , then the risk R(ω p ωs ) is given by .
The ε term in Theorem 1 quantifies the similarity between the source and target tasks.For example, if there exists a linear map ω p such that ω p ω s = ω t , then ε = 0.In the context of classification, this can occur if the target classes are a strict subset of the source classes.Since transfer learning is typically performed between source and target tasks that are similar, we expect ε to be small.To gain more insights into the behavior of transfer learning using the projected predictor, the following corollary considers the setting where d → ∞ in Theorem 1; the proof is given in Appendix I.
Remarks.Corollary 1 not only formalizes several intuitions regarding transfer learning, but also theoretically corroborates surprising dependencies on the number of source examples, target examples, and source classes that were empirically observed in Fig. 2 for kernels and in [24] for convolutional networks.First, Corollary 1a implies that increasing the number of source examples is always beneficial for transfer learning when the source and target tasks are related (ε ≈ 0), which matches intuition.Next, Corollary 1b implies that increasing the number of source classes while leaving the number of source examples fixed can decrease performance (i.e. if 2S − 1 − ST > 0), even for similar source and target tasks satisfying ε ≈ 0. This matches the experiments in Fig. 2c, where we observed that increasing the number of source classes when keeping the number of source examples fixed can be detrimental to the performance.This is intuitive for transferring from ImageNet32 to CIFAR10, since we would be adding classes that are not as useful for predicting objects in CIFAR10.Corollary 1c implies that when the source and target task are similar and the number of source classes is less than the data dimension, transfer learning with the projected predictor is always better than training only on the target task.Moreover, if the number of source classes is finite (C = 0), Corollary 1c implies that the risk of the projected predictor decreases an order of magnitude faster than the baseline predictor.In particular, the risk of the baseline predictor is given by (1 − T ) ω t 2 , while that of the projected predictor is given by (1 − T ) 2 ω t 2 .Note also that when the number of target samples is small relative to the dimension, Corollary 1c implies that decreasing the number of source classes has minimal effect on the risk.Lastly, Corollary 1d implies that when T and C are small, the risk of the projected predictor is roughly that of a baseline predictor trained on twice the number of samples.
Derivation of the scaling law for the translated predictor in the linear setting.Analogously to the case for projection, we analyze the risk of the translated predictor when ωs is the minimum norm solution to y s − ωX s 2 F and ωc is the minimum norm solution to , and let ωt = ωs + ωc where ωs = y s X † s and ωc = (y t − ωs X t )X † t .Assuming that P s and P t are independent, isotropic distributions on R d , the the risk R(ω t ) is given by where ωb = y t X † t is the baseline predictor.The proof is given in Appendix K. Theorem 2 formalizes several intuitions regarding when translation is beneficial.In particular, we first observe that if the source model ω s is recovered exactly (i.e.n s = d), then the risk of the translated predictor is governed by the distance between the oracle source model and target model, i.e., ω s − ω t .Hence, the translated predictor generalizes better than the baseline predictor if the source and target models are similar.In particular, by flattening the matrices ω s and ω t into vectors and assuming ω s = ω t , the translated predictor outperforms the baseline predictor if the angle between the flattened ω s and ω t is less than π 4 .On the other hand, when there are no source samples, the translated predictor is exactly the baseline predictor and the corresponding risks are equivalent.In general, we observe that the risk of the translated predictor is simply a weighted average between the baseline risk and the risk in which the source model is recovered exactly.
Comparing Theorem 2 to Theorem 1, we note that the projected predictor and the translated predictor generalize based on different quantities.In particular, in the case when n s = d, the risk of the translated predictor is a constant multiple of the baseline risk while the risk of the projected predictor is a multiple of the baseline risk that decreases with n t .Hence, depending on the distance between ω s and ω t , the translated predictor can outperform the projected predictor or vice-versa.As a simple example consider the setting where ω s = ω t , n s = d, and n t , c s < d; then the translated predictor achieves 0 risk while the projected predictor achieves non-zero risk.When Y s = X t , we suggest combining the projected and translated predictors, as we did in the case of virtual drug screening.Otherwise, our results suggest using the translated predictor for transfer learning problems involving distribution shift in the features but no difference in the label sets, and the projected predictor otherwise.

Discussion
In this work, we developed a framework that enables transfer learning with kernel methods.In particular, we introduced the projection and translation operations to adjust the predictions of a source model to a specific target task: While projection involves applying a map directly to the predictions given by the source model, translation involves adding a map to the predictions of a source model.We demonstrated the effectiveness of the transfer learned kernels on image classification and virtual drug screening tasks.Namely, we showed that transfer learning increased the performance of kernel-based image classifiers by up to 10% over training such models directly on the target task.Interestingly, we found that transfer-learned convolutional kernels performed comparably to transfer learning using the corresponding finite-width convolutional networks.In virtual drug screening, we demonstrated that the transferred kernel methods provided an improvement over prior work [42], even in settings where none of the target drug and cell lines were present in the source task.For both applications, we analyzed the performance of the transferred kernel model as a function of the number of target examples and observed empirtically that the transferred kernel followed a simple logarithmic trend, thereby enabling predicting the benefit of collecting more target examples on model performance.Lastly, we mathematically derived the scaling laws in the linear setting, thereby providing a theoretical foundation for the empirical observations.We end by discussing various consequences as well as future research directions motivated by our work.

Benefit of pre-training kernel methods on large datasets.
A key contribution of our work is enabling kernels trained on large datasets to be transferred to a variety of downstream tasks.As is the case for neural networks, this allows pre-trained kernel models to be saved and shared with downstream users to improve their applications of interest.A key next step to making these models easier to save and share is to reduce their reliance on storing the entire training set such as by using coresets [49].We envision that by using such techniques in conjunction with modern advances in kernel methods, the memory and runtime costs could be drastially reduced.

Reducing kernel evaluation time for state-of-the-art convolutional kernels.
In this work, we demonstrated that it is possible to train convolutional kernel methods on datasets with over 1 million images.In order to train such models, we resorted to using the CNTK of convolutional networks with a fully connected last layer.While other architectures, such as the CNTK of convolutional networks with a global average pooling last layer, have been shown to achieve superior performance on CIFAR10 [2], training such kernels on 50k images from CIFAR10 is estimated to take 1200 GPU hours [36], which is more than three orders of magnitude slower than the kernels used in this work.The main computational bottleneck for using such improved convolution kernels is evaluating the kernel function itself.Thus an important problem is to improve the computation time for such kernels, which would allow training better convolutional kernels on large-scale image datasets, which could then be transferred using our framework to improve the performance on a variety of downstream tasks.
Using kernel methods to adapt to distribution shifts.Our work demonstrates that kernels pre-trained on a source task can adapt to a target task with distribution shift when given even just a few target training samples.This opens novel avenues for applying kernel methods to tackle distribution shift in a variety of domains including healthcare or genomics in which models need to be adapted to handle shifts in cell lines, populations, batches, etc.In the context of virtual drug screening, we showed that our transfer learning approach could be used to generalize to new cell lines.The scaling laws described in this work may provide an interesting avenue to understand how many samples are required in the target domain for more complex domain shifts such as from a model organism like mouse to humans, a problem of great interest in the pharmacological industry.

C Training and Architecture Details
Model descriptions: 1. Laplace Kernel: For samples x, x, and bandwidth parameter L, the kernel is of the form: For our experiments, we used a bandwidth of L = 10.

NTK:
We used the NTK corresponding to an infinite width ReLU fully connected network with 5 hidden layers.We chose this depth as it gave superior performance on image classification task considered in [34].

CNTK:
We used the CNTK corresponding to an infinite width ReLU convolutional network with 6 convolutional layers followed by a fully connected layer.All convolutional layers used filters of size 3 × 3. The first 5 convolutional layers used a stride size of 2 to downsample the image representations.All convolutional layers used zero padding.The CNTK was computed using the Neural Tangents library [36].

CNN:
We compare the CNTK to a finite-width CNN of the same architecture that has 16 filters in the first layer, 32 filters in the second layer, 64 filters in the third layer, 128 filters in the fourth layer, and 256 filters in the fifth and sixth layers.In all experiments, the CNN was trained using Adam with a learning rate of 10 −4 .

Details for projection experiments.
For all kernels trained on ImageNet32, we used EigenPro [31].For all models, we trained until the training accuracy was greater than 99%, which was at most 6 epochs of EigenPro.For transfer learning to CIFAR10, Oxford 102 Flowers, DTD, and SVHN, we applied a Laplace kernel to the outputs of the trained source model.For CIFAR10 and DTD, we solved the kernel regression exactly using NumPy [38].For DTD and SVHN, we used ridge regularization with a coefficient of 10 −4 to avoid numerical issues with solving exactly.The CNN was trained for at most 500 epochs on ImageNet32, and the transferred model corresponded to the one with highest validation accuracy during this time.When transfer learning, we fine-tuned all layers of the CNN for up to 200 epochs (again selecting the model with the highest validation accuracy on the target task).

Details for translation experiments.
For transferring kernels from CIFAR10 to CIFAR-C, we simply solved kernel regression exactly (no ridge regularization term).For the corresponding CNNs, we trained the source models on CIFAR10 for 100 epochs and selected the model with the best validation performance.
When transferring CNNs to CIFAR-C, we fine-tuned all layers of the CNN for 200 epochs and selected the model with the best validation accuracy.When translating kernels from ImageNet32 to CIFAR10 in Appendix Fig. 9, we used the following aggregated class indices in ImageNet32 to match the classes in CIFAR10: 1. plane = {372, 230, 231, 232} Details for virtual drug screening.We used the NTK corresponding to a 1 hidden layer ReLU fully connected network with an offset term.The same model was used in [42].We solved kernel ridge regression when training the source models, baseline models, and transferred models.For the source model, we used ridge regularization with a coefficient of 1000.To select this ridge term, we used a grid search over {1, 10, 100, 1000, 10000} on a random subset of 10k samples from the source data.We used a ridge term of 1000 when transferring the source model to the target data and a term of 100 when training the baseline model.We again tuned the ridge parameter for these models over the same set of values but on a random subset of 1000 examples for one cell line (A549) from the target data.We used 5-fold cross validation for the target task and reported the metrics computed across all folds.

D Projection Scaling Laws
For the curves showing the performance of the projected predictor as a function of the number of target examples in Fig. 2b and Appendix Fig. 6a, b, we performed a scaling law analysis.In particular, we used linear regression to fit the coefficients a, b of the function y = a log 2 x + b to the points from each of the curves presented in the figures.Each curve in these figures has 50 evenly spaced points and all accuracies are averaged over 3 seeds at each point.The R 2 values for each of the fits is presented in Appendix Fig. 8. Overall, we observe that all values are above 0.944 and are higher than 0.99 for CIFAR10 and SVHN, which have more than 2000 target training samples.Moreover, by fitting the same function on the first 5 points from these curves for CIFAR10, we are able to predict the accuracy on the last point of the curve within 2% of the reported accuracy.

E Pre-processing for CMAP Data
While CMAP contains 978 landmark genes, we removed all genes that were 1 upon log 2 (x + 1) scaling the data.This eliminates 135 genes and removes batch effects identified in [7] for each cell line.Following the methodology of [7], we also removed all perturbations with dose less than 0 and used only the perturbations that had an associated simplified molecular-input line-entry system (SMILES) string, which resulted in a total of 20, 336 perturbations.Following [7], for each of the 116, 228 observed drug and cell type combinations we then averaged the gene expression over all the replicates.

F Metrics for Evaluating Virtual Drug Screening
Let ŷ ∈ R n×d denote the predicted gene expression vectors and let y * ∈ R n×d denote the ground truth.Let Let ŷv , y * v ∈ R dn denote vectorized versions of ŷ and y * .We use the same three metrics as those considered in [23,42].All evaluation metrics have a maximum value of 1 and are defined below.

Mean Cosine Similarity:
We additionally subtract out the mean over cell type before computing cosine similarity to avoid inflated cosine similarity arising from points far from the origin.

G DepMap Analysis
To provide another application of our framework in the context of virtual drug screening, we used projection to transfer the kernel methods trained on imputing gene expression vectors in CMAP to predicting the viability of a drug and cell line combination in DepMap [13].Viability scores in DepMap are real values indicating how lethal a drug is for a given cancer cell line (negative viability indicates cell death).To transfer from CMAP to DepMap, we trained a kernel method to predict the gene expression vectors for 55, 462 cell line and drug combinations for the 64 cell lines from CMAP that do not overlap with DepMap.We then used projection to transfer the model to the 6 held-out cell lines present in both CMAP and DepMap, which are PC3, MCF7, A375, A549, HT29, and HEPG2.Analogously to our analysis of CMAP, we stratified the target dataset by drugs that appear in both the source and target tasks (9726 target samples) and drugs that are only found in the target task but not in the source task (2685 target samples).For this application, we found that Mol2Vec [26] embeddings of drugs outperformed 1024-bit circular fingerprints.We again used a 1-hidden layer ReLU NTK with an offset term for this analysis and solved kernel ridge regression with a ridge coefficient of 100.
Appendix Fig. 12a shows the performance of the projected predictor as a function of the number of target samples when transferring to a target task with drugs that appear in the source task.All results are averaged over 5 folds of cross-validation and across 5 random seeds for the subset of target samples considered in each fold.It is apparent that performance is greatly improved when there are fewer than 2000 samples, thereby highlighting the benefit of the imputed gene expression vectors in this setting.Interestingly, as in all the previous experiments, we find a clear logarithmic scaling law: fitting the coefficients of the curve y = a log 2 x + b to the 76 points on the graph yields an R 2 of 0.994, and fitting the curve to the first 10 points lets us predict the R 2 for the last point on the curve within 0.03.Appendix Fig. 12b shows how the performance on the target task is affected by the number of genes predicted in the source task.Again performance is averaged over 5 fold cross-validation and across 5 seeds per fold.When transferring to drugs that were available in the source task, performance monotonically increases when predicting more genes.On the other hand, when transferring to drugs that were not available in the target task, performance begins to degrade when increasing the number of predicted genes.This is intuitive, since not all genes would be useful for predicting the effect of an unseen drug and could add noise to the prediction problem upon transfer learning.

H Proof of Theorem 1
The proof of Theorem 1 relies on the following lemma.
Lemma 1.Let D, Λ ∈ {0, 1} d×d be two diagonal matrices of rank p and q, respectively.Let V ∈ R d×d be an orthogonal matrix and W ∈ R d×d a Haar distributed random matrix.
Proof.Without loss of generality, assume that Λ = diag(1 q , 0 d−q ), and D = diag(1 p , 0 d−p ).Since the Haar distribution is rotational invariant, U = V T W is Haar distributed.Therefore, where A = ΛU DU T .Now the upper left q × p block of ΛU D is equal to the corresponding block in U , and all other entries of ΛU D are 0. Letting ũi = (u i1 , . . ., u ip ), we have Thus, (A T A) i,r = q k=1 ũi , ũk ũr , ũk , and so, E U [A T A] only depends on the fourth moments of the entries in U .In particular, To calculate these moments, we use Lemma 9 from [10].In particular, if i = r, then E U [(A T A) i,r = 0, and if i = r, then Therefore, we have the following closed form for the expectation: and V ΛV T = Q, the result follows.
We now prove the following simpler version of Theorem 1 for the case when n s = d (i.e., when ωs = ω s ). ε, , and X ⊥ = I d×d −X || , and note that ω ⊥ s , ω || , X ⊥ , X || ∈ {0, 1} d×d are all projections.Then, we have: Using the cyclic property of the trace, the risk is given by Using the idempotent property of projections and the fact that ω ⊥ s ω and as a consequence that Both X ⊥ , ω || s are projections, and since X follows an isotropic distribution, its right singular vectors (the eigenvectors of X ⊥ ) are Haar distributed.Now using Lemma 1 with p = d − n t , q = c s we obtain that Using ω s = I d×d − ω ⊥ s and reordering the terms we obtain Lastly, we use the standard result that R(ω t ) = 1 − nt d (see e.g.[6]) and that ε = Tr which completes the proof.
Using Lemma 1 and Theorem 3, we next prove Theorem 1, which is restated below for the reader's convenience.
, and let ωs = y s X † s and ωp = y t (ω s X t ) † .Assuming that P s and P t are independent, isotropic distributions on R d , then the risk R(ω p ωs ) is given by .
Proof.As in the proof of Theorem 3, we let X = X t to simplify notation.We follow the proof of Theorem 3, but now account for the expectation with respect to X s .Namely, Using the independence of ωs and X ⊥ and Fubini's theorem, we compute the expectations sequentially: As a consequence, we calculate the two expectations involving X s by using Lemma 1 with p = n s , q = c s .In particular, we conclude that Therefore, R(ω p ωs ) is given by the sum of the following terms: where for the second equality, we applied Theorem 3, which gives rise to K 1 and K 2 , thereby completing the proof.

I Proof of Corollary 1
Proof.We restate Corollary 1 below for the reader's convenience.

Corollary. Let
We first derive forms for the terms C 1 , C 2 , K 1 , K 2 from Theorem 1 as d → ∞.In particular, we have: Substituting these values into R(ω p ωs ) in Theorem 1, we obtain Next, we analyze Eq. ( 5) for S ∈ [0, 1].For fixed T and C, it holds that R(ω p ωs ) is a quadratic in S and given by F , this quadratic is strictly decreasing and thus we can conclude that R(ω p ωs ) is decreasing in S. We next observe that R(ω p ωs ) is linear in C and thus R(ω p ωs ) decreases as C increases if and only if the coefficient of C is negative, i.e. (2S − ST − 1) < 0. Lastly, if S = 1, then Corollary 1c, d follow from the above form of the risk, thus completing the proof.

J Equivalence of Fine-tuned and Translated Linear Models
We now prove that for linear models transfer learning using the translated predictor from Definition 2 is equivalent to transfer learning via the conventional fine-tuning process.This follows from Proposition 1 below, which implies that when parameterized by a linear model, the translated predictor is the interpolating solution for the target dataset that is nearest to the source predictor.where the constraint can be simplified to w, ψ(X t ) H = y t − fs (X t ).This is precisely the constraint for the translated predictor in Definition 2, thereby completing the proof.

K Proof of Theorem 2
We restate Theorem 2 below for convenience and then provide the proof.
, and let ωt = ωs + ωc where ωs = y s X † s and ωc = (y t − ωs X t )X † t .Assuming that P s and P t are independent, isotropic distributions on R d , the the risk R(ω t ) is given by where ωb = y t X † t is the baseline predictor.
Proof.We prove the statement by directly simplifying the risk as follows.

R(ω
where the penultimate equality follows from adding and subtracting the term ns d ω t

L Code and Hardware Details
All experiments were run using two servers.One server had 128GB of CPU random access memory (RAM) and 2 NVIDIA Titan XP GPUs each with 12GB of memory.This server was used for the virtual drug screening experiments and for training the CNTK on ImageNet32.The second server had 128GB of CPU RAM and 4 NVIDIA Titan RTX GPUs each with 24GB of memory.This server was used for all the remaining experiments.All code is available at https://github.com/uhlerlab/kernel_tf. CNN (ImageNet32 to CIFAR-C) (c)    x + b to the curves found empirically for the projected predictor performance as a function of the target samples.We see that for all kernels and datasets, the fit yields R 2 values greater than 0.94 and values higher than 0.99 on datasets with more samples such as CIFAR10 and SVHN.We observe that predicting more genes in the source task is helpful when transferring to new cell lines for which the considered drugs were available in the source task.
Predicting more genes for new cell lines for which the considered drugs were not available in the source task is harmful when transfer learning.

Figure 2 :
Figure 2: Analysis of transfer learning with kernels trained on ImageNet32 to CIFAR10, Oxford 102 Flowers, DTD, and a subset of SVHN.All curves in (b,c) are averaged over 3 random seeds.(a) Comparison of the transferred kernel predictor test accuracy (green) to the test accuracy of the baseline kernel predictors trained directly on the target tasks (red).In all cases, the transferred kernel predictors outperform the baseline predictors and the difference in performance is as high as 10%.(b) Test accuracy of the transferred and baseline predictors as a function of the number of target examples.These curves, which quantitatively describe the benefit of collecting more target examples, follow simple logarithmic trends (R 2 > .95).(c) Performance of the transferred kernel methods decreases when increasing the number of source classes but keeping the total number of source examples fixed.Corresponding plots for DTD and SVHN are in Appendix Fig. 6.
Transfer learning with kernels from CIFAR-10 to CIFAR-

Figure 3 :
Figure 3: Transferring kernel methods from CIFAR10 to adapt to 19 different corruptions in CIFAR10-C.(a)Test accuracy of baseline kernel method (red), using source predictor given by directly applying the kernel trained on CIFAR10 to CIFAR10-C (gray), and transferred kernel method (green).The transferred kernel method outperforms the other models on all 19 corruptions and even improves on the baseline kernel method when the source predictor exhibits a decrease in performance.Additional results are presented in Appendix Fig. 10.(b) Performance of the transferred and baseline kernel predictors as a function of the number of target examples.The transferred kernel method can outperform both source and baseline predictors even when transferred using as little as 200 target examples.

Fig. 10 .
Figure 3: Transferring kernel methods from CIFAR10 to adapt to 19 different corruptions in CIFAR10-C.(a)Test accuracy of baseline kernel method (red), using source predictor given by directly applying the kernel trained on CIFAR10 to CIFAR10-C (gray), and transferred kernel method (green).The transferred kernel method outperforms the other models on all 19 corruptions and even improves on the baseline kernel method when the source predictor exhibits a decrease in performance.Additional results are presented in Appendix Fig. 10.(b) Performance of the transferred and baseline kernel predictors as a function of the number of target examples.The transferred kernel method can outperform both source and baseline predictors even when transferred using as little as 200 target examples.

Figure 4 :
Figure 4: Transferring the NTK trained to predict gene expression for given drug and cell line combinations in CMAP to new drug and cell line combinations.(a, b) The transfer learned NTK (green) outperforms imputation by mean over cell line (gray) and previous NTK baseline predictors from [42] across R 2 , cosine similarity, and Pearson r metrics.All results are averaged over the performance on 5 cell lines and are stratified by whether or not the target data contains drugs that are present in the source data.(c, d) The transferred kernel method performance follows a logarithmic trend (R 2 > .9)as a function of the number of target examples and exhibits a better scaling coefficient than the baselines.The results are averaged over 5 cell lines.(e, f) Visualization of the performance of the transferred NTK in relation to the top two principal components of gene expression for target drug and cell line combinations.The performance of the NTK is generally lower for cell and drug combinations that are further from the control gene expression for a given cell line.Visualizations for the remaining 3 cell lines are presented in Appendix Fig. 11.

Corollary 1 .
Let S = ns d , T = nt d , C = cs d and assume ω t F = Θ(1).Under the setting of Theorem 1, if S, T, C < ∞ as d → ∞, then: a) R(ω p ωs ) is monotonically decreasing for

3 . 4 . 5 .
Oxford 102 Flowers contains 1020 training images across 102 classes and 6149 images for validation.Images were resized to 32 × 32 × 3 for the experiments.DTD contains 1880 training images across 47 classes and 1880 images for validation.Images were resized to size 32 × 32 × 3 for experiments.SVHN contains 73257 training images across 10 classes and 26302 images for validation.All images are of size 32 × 32 × 3.In Fig. 2, we used the same 500 training image subset for all experiments.

2 Ft 2 F
and the last equality is given by R( ωb ) = 1 − nt d w , thereby completing the proof.

Figure 5 :
Figure 5: Image classification performance of CNNs that are finite-width analogs of the CNTK considered in this work.(a) The accuracy of the CNNs on 4 target tasks when transferred from ImageNet32.All layers of the CNNs are fine-tuned during transfer learning.The CNN in the top row achieves a test accuracy of 16.72% on ImageNet32.The early stopped CNN in the bottom row achieves an accuracy of 10.692% on ImageNet32, which is comparable with the accuracy of the CNTK (10.64%).(b) Performance of a CNN pretrained on CIFAR10 when transferred to CIFAR-C.(c) Performance of a CNN pre-trained on ImageNet32 when transfered to CIFAR-C.

Figure 6 :
Figure 6: (a, b) Performance of the projected kernel method as a function of the number of target examples when transferred from ImageNet32 to DTD and SVHN.(c, d) Performance of the projected kernel method as a function of the number of source classes when transferred from ImageNet32 to DTD and SVHN.The number of source examples was fixed to 40k and we ensured that the number of source classes divides 40k.

Figure 7 :Figure 8 : R 2
Figure 7: (a, b, c) Performance of three different kernels as a function of the number of source examples and target examples when projected from ImageNet32 to CIFAR10.The baseline predictor performance is shown as a dashed black line.Overall, we find that performance improves as the number of source training samples per class increases.

Figure 9 :Figure 11 :Figure 12 :
Figure9: Accuracy of the translated predictor form ImageNet32 to CIFAR10.The black dashed line corresponds to the baseline predictor while the dashed color lines correspond to the source predictors.We observe that the translated predictor outperforms both projected and baseline predictors when increasing the number of target samples, and the performance of the translated predictor increases as the number of source examples per class increases..
Prior work for virtual drug screening

Proposition 1 .
Let fs (x) = w s , ψ(x) H , where ψ : R d → H is a feature map and H is a Hilbert space.Then the translated predictor, ft , is the solution to t ) H = y t .Proof.Note that any solution w to Problem 6 can be written as w = w s + w.Hence, we can rewrite Problem 6 as follows: arg min w w H subject to w s + w, ψ(X t ) H = y t ,