Contrastive language and vision learning of general fashion concepts

The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from general and transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model adapted for the fashion industry. We demonstrate the effectiveness of the representations learned by FashionCLIP with extensive tests across a variety of tasks, datasets and generalization probes. We argue that adaptations of large pre-trained models such as CLIP offer new perspectives in terms of scalability and sustainability for certain types of players in the industry. Finally, we detail the costs and environmental impact of training, and release the model weights and code as open source contribution to the community.

Here, θ I and θ T are the learnable parameters of the image and text encoder neural networks, and the (·) operator represents the dot product. The first addition operand of Equation 1 is the cross-entropy on the image axis, while the second addition operand is on the text axis.
Recently, industry practitioners have begun to recognize the importance and utility of contrastive pre-training for their target domain, with several works presenting successful downstream applications starting from the CLIP model 30 . In fashion, the multi-modal nature of CLIP has been found helpful in recent discriminative 31,32 models, which have been developed under the standard paradigm of task-specific, supervised models. In the generative setup, CLIP often complements a larger framework: for example, CLIP is used to learn linguistically grounded codebooks in Variational Auto Encoders 33 or to guide image synthesis and manipulation in diffusion generative models. 34,35 While interesting for grounding (see below Fig. 8), the target use case (image generation) and more narrow focus (single task, single dataset) are not readily comparable to FashionCLIP, but instead suggest a possible complementary application in generative use cases. However, no recent CLIP application has been developed to produce industry-wide representations across multiple use cases and datasets. In other words, CLIP has been used only as a pre-trained model, with no attempt to overcome single-task supervised models' operational and conceptual problems.
(2) θ I * , θ T * = arg min θ I ,θ T L (θ I , θ T ) 1. Text to image Product search is one of the main channels of interactions and revenues between a shop and its users, accounting on average for 30% to 60% of the total online revenues 38,39 . Historically, product search has been performed chiefly with textual features by first matching queries and product descriptions in an index [40][41][42] and then re-ranking the candidate results 43 . However, there are good reasons to believe that including visual features can bring significant improvements since images are often the most curated aspect of the catalog. In contrast, text quality varies throughout verticals, languages, and specific product feeds. Our extensive tests show that FashionCLIP learns fashion concepts, and successfully applies them to unseen products and incomplete or ambiguous descriptions. 2. Image to text Product classification is the task of predicting a product category given its meta-data. Classification (even for CLIP-based models, such as CMA-CLIP) is cast as a supervised learning problem where one extracts golden labels from the catalog itself or collects them through crowd-sourcing 7,44 . Generalizing classification to arbitrary labels without constant retraining is again crucial for making ML feasible across numerous players in the fashion industry: transferable concepts help with the interoperability of overlapping, and yet different, fashion taxonomies 45 , a challenge increasingly recognized as central by both practitioners and commentators 46 (this includes the case of catalogs in less represented languages, for which an English classification is still desirable). Our extensive tests show that FashionCLIP zero-shot capabilities, based on learned associations between vision and textual concepts, allow for quick classification of products in target classes of interest, irrespective of the specific labeling schemes of individual suppliers.
We also perform specific probing to understand whether the concepts learned by the model are robust and (somehow) aligned with human semantic intuitions, as opposed to picking up spurious correlations in the dataset 47 : 1. Grounding. We probe FashionCLIP for grounding capabilities through localization maps and apply them to the task of zero-shot semantic segmentation for fashion concepts (e.g., sleeve length, texture patterns) and attributes (e.g., laced shoes, glittered shirts). 2. Compositionality. We propose a novel way to probe whether the model can go from semantic segmentation to inferential abilities by composing such concepts to generate new linguistic expressions. We do that through the device of "improbable object", where a linguistic expression is meant to describe an odd combination of concepts that have never been observed before (e.g., a pair of shoes with handles).  www.nature.com/scientificreports/ We summarize our contributions as follows: 1. while other researchers have independently developed CLIP-based solutions for individual fashion problems, FashionCLIP is the first explicit attempt to produce general multi-modal concepts for industry: the breadth and nature of our testing methodology make FashionCLIP appealing as a general fashion model, applicable to situations where supervised systems are not practical or viable. Our model is trained on over 700k < image, text > pairs from the inventory of Farfetch, one of the largest fashion luxury retailers in the world, and is shown to be useful in important use cases in a vast global market; 2. we evaluate FashionCLIP in various tasks, showing that fine-tuning helps capture domain-specific concepts and generalizes them in zero-shot scenarios; we supplement quantitative tests with qualitative analyses and offer insights into how concepts grounded in a visual space unlock linguistic generalization. These results would not be possible without the flexibility provided by natural language as a supervision signal and the domain-specific accuracy achieved through fine-tuning; 3. we transparently report training time, costs, and emissions. We additionally release to the community, under an open-source license, training code, a demo app, and plug-and-play checkpoints to help leverage our findings while facilitating ROI considerations 48,49 ; the large and unique dataset is also scheduled to be released directly by Farfetch. Taken together, FashionCLIP artifacts (model, demo, data) are a foundational toolkit for practitioners in the space and a template for other verticalized CLIP models (https://github.com/patrickjohncyh/fashion-clip).
We believe that our methods and results are interesting not just for the fashion industry but broadly speaking for the ever-expanding industry of online retail, as our artifacts, use cases and benchmarks might serve as a blueprint for other vertical-specific applications of large multi-modal models. Finally, adding to the industry significance of the work, the evaluation in "Grounding and Compositionality" section is new in the context of CLIP-like models, and we believe it may be of independent interest for future work in NLP. As a matter of fact, showcasing a practical thread connecting generalization and latent space interpretation to industrial scalability may be the most interesting contribution of FashionCLIP.

Results
In this section, we detail the performance of FashionCLIP over a range of tasks, demonstrating the efficacy of domain adaptation and the applicability of CLIP-like models to fashion. Details on the training and on the evaluation are available in the "Methods" Section. We leverage a variety of in-domain and out of domain datasets, with varying degrees of similarity: TEST is the test set from Farfetch containing 20k products; HOUT-C is the dataset containing a category which we excluded from training; HOUT-B is the dataset containing two brands which were excluded from training; STLE is a merchandising dataset from Farfetch; KAGL is a subset of 50 , where each product has a white background image, a caption, and a category; F-MNIST 51 contains 10, 000 gray-scale images from 10 product classes; DEEP 52 contains 4000 product images that are non-standardized (i.e., contain humans) from 50 categories. An overview of image and textual data offered by Farfetch (TEST, HOUT-C, HOUT-B, STLE), KAGL, F-MNIST and DEEP can be found in Fig. 3. Our extensive benchmarks and evaluations answer two research questions quantitatively: can domain-specific knowledge improve CLIP understanding of an industry ( Fig. 5) and, if yes, does that knowledge translate across different use cases and datasets?
Multi-modal retrieval. The Multi-modal Retrieval task is described as follows: given a textual description and a set of images, we ask the model to find the image related to that description. For example, a product retrieval task entails matching a product description (e.g., "a red polo for men") and a photo of it in a catalog. Multi-modal retrieval is possible due to the optimization objective of FashionCLIP which aligns the language and image latent spaces (see Fig. 1). We test FashionCLIP on multi-modal retrieval to assess the benefits of domain-specific fine-tuning on real-world product search.
Our benchmark takes as input a product description from the catalog's test set and asks models to rank product images corresponding to the caption-the gold standard is the image associated with the product. We extract the ranking using embedding similarities: FashionCLIP performs the dot product between the input caption embedding and each image vector embedding obtained via T θ T (·) and Ī θ I (·) respectively and returns a rank based on descending order. We use HITS@5 53 (Hit Rate @ k = 5 ) and MRR 54 (Mean Recirpocal Rank) as our metrics. Table 1 compares FashionCLIP against non-domain specific CLIP on different heldout test sets and shows how fine-tuning significantly improves the understanding of our target domain.
We also perform extensive qualitative tests comparing FashionCLIP with the production search engine presently employed in the catalog. Fig. 4 shows a case of particular interest for product search: in this example, visual concepts do not belong to the fashion domain and are not available in the caption. The first comparison (left) shows that FashionCLIP can recover the concept of tiger when prompted with "t-shirt with tiger"; for the same query, the search engine retrieves items matching the category, unable to interpret tiger based solely on text. The second comparison (right) shows that FashionCLIP can interpret a cat from a stylized, partially occluded drawing. In contrast, the search engine fails to generalize beyond the captions explicitly containing the string "cat". Finally, visualizing the learnt embeddings ( Fig. 5) also helps to build an intuition of FashionCLIP's better conceptual resolution when it comes to the target domain.
Zero-shot classification. We replicate CLIP's original zero-shot classification setup 17 , which allows us to quantitatively assess the transferability of FashionCLIP's fine-tuned representations to different data distribu- www.nature.com/scientificreports/  . Retrieval with non-fashion concepts. Sample results for "t-shirt with tiger" and "t-shirt with cat" from FashionCLIP (green) vs Farfetch production search engine (red). www.nature.com/scientificreports/  www.nature.com/scientificreports/ tions from the same vertical (i.e. Fashion). The model generates one image embedding for the product image, and k text embeddings, one for each of the labels in the classification scheme (e.g., "shoes", "shirt"). The predicted label is the one that is closer (measured via dot product) to the image in the model's vector space. We use weighted macro F1 55 as the performance metric. Table 2 summarizes the results of different SOTA benchmarks. On all the tested benchmarks, FashionCLIP is superior to CLIP, a result which suggests that domain-specific fine-tuning is indeed useful in-domain and that it generalizes to other, completely unseen datasets. Furthermore, we set out to investigate the "cheating hypothesis" on our domain-specific model, i.e., the hypothesis that supervised models do not generalize as well as CLIP because they fit spurious features unique to each dataset. We freeze the image encoder from FashionCLIP and fine-tune a linear classifier, LINEAR, over the embeddings generated on a subset of categories (47) from the validation set from Farfetch. We run benchmarks on TEST S , KAGL S , F-MNIST S and DEEP S , sub-sampled versions of the respective datasets. Where labels are different, we adapt LINEAR to the labels by pooling the scores from relevant classes. We compare this to zero-shot performance, using the original labels to generate the text embeddings. Table 3 reports our findings, which are partially similar to those from CLIP 17 . Given that F-MNIST is very different from TEST-comparable, for example, to CIFAR-100 56 vs. ImageNet 57 -the decrease in performance may be an indication of cheating. However, LINEAR performs well on the other datasets, with the biggest gain for KAGL, whose product image most resembles those in TEST (i.e., high-resolution items on a white background). Compared to the original setting 17 , one may argue that the supervised model has an easier job in our case: much fewer categories ( 10 1 vs. 10 3 ) and relatively homogeneous items, F-MNIST aside.
While we leave the investigation of fashion classification in more ecological settings as future work, our results contain actionable insights for real-world deployments. In particular, supervised classifiers still require a good deal of manual intervention even for similar datasets, and they are utterly unusable on neighboring yet different problems. Table 4 reports performance on STLE divided by Man-and Woman-related items. Products in the dataset still come from Farfetch, but labels are manually assigned by merchandisers and are orthogonal to the taxonomy (classic, streetwear, edgy vs. shoes, hats, bags). The versatility afforded by language supervision allows zero-shot models to tackle the challenge by simple prompt engineering ("an item in classic style"); in contrast, supervised models would require a new training and evaluation pipeline. As emphasized above, learning general fashion concepts is the main motivation behind this work: while specific, supervised pipelines may still be the best choice for specific problems, they are no longer the only viable option in multi-tasks scenarios thanks to the advent of large-scale models such as FashionCLIP. Although no single answer can fit all the use cases, we wish to encourage data-driven decision-making by charting all the options and providing cost and performance assessments.

Grounding and compositionality.
As argued in the Introduction, given that we are interested in establishing a connection between generality and scalability through large multi-modal models, it is important to further evaluate the quality of the learned representations. While the question of whether FashionCLIP learns fashion has been addressed quantitatively above, we are also interested in evaluating the model from a broader theoretical perspective of language understanding, offering a glimpse of the extent of FashionCLIP's "true" generalization capabilities, ala "infinite use of finite means" 59 .
The literature on language compositionality spans centuries: limiting ourselves only to recent work, grounding has been explored in connection with efficient learning 60,61 , and "true understanding" 62,63 . Using combinatorial principles to test generalization abilities is a known strategy in toy world 64,65 : we exploit insights from our target domain to operationalize similar principles on real-world objects.
In this section, we provide evidence of semantic grounding in FashionCLIP and build on that to offer a preliminary investigation of its compositional abilities. Our analysis starts from two lessons from previous research. First, localization maps 66,67 are an effective way to probe the model for referential knowledge 68 (we borrow here the referential/inferential distinction from the classic work by Marconi 69 ) and visually grounded lexical knowledge. Second, from a linguistic point of view most search queries in fashion have the form of Noun Phrases (NPs)-e.g. "armani dress". Therefore, the semantics of NP can be considered a good real-world generalization 9,70 for studying FashionCLIP compositional and inferential abilities.
Grounding. We probe FashionCLIP for evidence of referential knowledge and investigate its grounding capabilities by utilizing localization maps. We further apply localization maps to the task of zero-shot fashion parsing-a crucial open problem in the industry 71 .
Localization maps are obtained by repeatedly occluding different parts of the image. We then encode each occluded version and measure its distance from the target text in the contrastive space. Intuitively, the farther the image is pushed away by the occlusion, the stronger the linkage was between the removed visual concept and the text and, in turn, the higher its score on the map. Fashion parsing is a specific case of semantic segmentation where bounding box annotations contain clothing items. We extract bounding box annotations (as an approximation of fine-grained segmentation) from localization maps by finding the minimum bounding rectangle of highly activated areas.
As shown in Figs. 6 and 8, features such as "high heels", "ankle strap", "long sleeves" are well represented in FashionCLIP; the model also seems to be very aware of brands, in more or less explicit form. FashionCLIP picks up the abstract logo on sneakers (Fig. 6), as well as showing (similar to CLIP) good OCR capabilities, when recognizing a logo as an explicit text string. Fig. 7 shows zero-shot bounding box annotations of some samples in the previously unseen ModaNet 72 dataset. While it is unlikely that zero-shot models could replace specialized segmentation training, we believe that models such as FashionCLIP could provide a cheap way to generate probabilistic labels for weak supervision pipelines. www.nature.com/scientificreports/ Compositionality. Given the preliminary evidence that isolated concepts reliably map onto visual regions, our working hypothesis is that FashionCLIP should exhibit true inferential abilities by composing such concepts to generate new NPs. We build on domain knowledge, previous literature 52 and Farfetch's inventory to probe the model for knowledge of brands (e.g. "nike"), features ("high heels"), and drawings ("keyboard"), manually verifying the text-toregion mapping for each of these concepts via localization maps. Given that these single concepts are grounded Figure 6. Grounded lexical knowledge. Maps are easy-to-use probes into the model fashion knowledge. Left to right: localization map for "long sleeves" on a red polo; sneakers and the map for "Nike", a phone cover and the map for "Palm Angels"; the same phone cover and map, when the logo is written with an out-of-distribution font in a new spot. Figure 7. Item bounding-box detection. Localization maps can be easily extended to provide zero-shot bounding boxes for items of interest.Green bounding boxes show the predicted locations for fashion concepts "Backpack" (left) and "Straw hat" (right). Images above are taken from the publicly available Unsplash Lite Dataset 1.2.0: FashionCLIP was tested extensively on ModaNet -please reach out to authors for links to those images. Figure 8. Grounding and compositionality. Localization maps for a product retrieved with the query "ankle strap sandals with high heels": left-to-right, the product, "ankle strap", "sandals", "high heels"). www.nature.com/scientificreports/ in regions (Fig. 8), we could leverage this knowledge to generate new images and NPs systematically. Crucially, we can assign a defined semantics to a new brand + object NP that describes an "improbable object" that has never been seen before (Fig. 9). Improbable objects vary: they may portray odd combinations of concepts, such as a Nike long dress, a surreal item, sneakers with handles, or an unlikely extension of existing fashion items, such as the keyboard pochette (which generalizes the theme first found in J. Mugatu's keyboard tie). A new NP such as "nike dress" would require the visual region corresponding to the word dress to contain the visual region of the logo corresponding to the word nike. We supplement our analysis by re-purposing our classification and retrieval pipeline: in the classification task, FashionCLIP achieves an accuracy of 0.74 when asked to pick the improbable label out of a set of credible distractors. The following are examples of test cases:

Backpack Straw hat
• target: NIKE DRESS (as seen in Fig. 9), labels: Nike dress, an Armani dress, a shirt, the flag of Italy, a Gucci dress, a Nike t-shirt; • target: BLACK SHOES WITH RED HEEL, labels: black shoes with red heel, black shoes, red shoes with red heel, red shoes with black heel, red shoes, fuchsia shoes, the flag of Italy, sneakers, black sneakers, a bag. • target: RED SHOES WITH BLACK HEEL (as seen in Fig. 9), labels: black shoes with red heel, black shoes, red shoes with red heel, red shoes with black heel, red shoes, fuchsia shoes, the flag of Italy, sneakers, black sneakers, a bag.
For the retrieval task, we add the new images to TEST, and use the NPs as queries: out of 20k products, the model's top choice is correct half the time (HITS@1 = 0.53 ), a percentage that quickly rises to 0.82 with k = 5 (as a comparison, CLIP scored HITS@1 = 0.51 and HITS@5 = 0.73). While a full-fledged investigation of compositional abilities is beyond the scope of this contribution, Fash-ionCLIP inferences on improbable products suggest the presence of some degree of compositionality: important fashion concepts are "identifiable" in the latent space and can be singled out and re-combined into unseen concepts, exhibiting on a small scale the creative generalization we usually associate with symbolic systems 73 . In addition, the ability to distinguish "red shoes with black heel" from "black shoes with red heel" implies knowledge beyond a bag-of-words semantics 74 .
Recent research suggests that CLIP's compositional capabilities are limited. 75 . As shown by our results, restricted domains allow for direct manipulation, without the risk of confounding; indeed, restricted domains may be easier to explore but further investigation is needed to confirm compositional capabilities. Furthermore, as suggested by the use of the MASKClip objective introduced in the ARMANI model 33 , adding explicit visual segmentation may induce better discrimination for certain fashion concepts. While more costly losses are an interesting area at the intersection of grounding and compositionality, given both the narrow generative focus and the magnitude of the improvements in the original paper 33 , their conclusions cannot be readily applied to FashionCLIP. We look forward to performing future research combining insights from generative and discriminative use cases.

Discussion
FashionCLIP is a domain-adaptation of CLIP, motivated by central use cases in fashion 71 : differently from task-specific supervised methods, FashionCLIP does not need a specialized architecture, labeling, and tuning. We extensively verified the flexibility afforded by language supervision, and investigated FashionCLIP's semantic capabilities on new tasks. Our focus on a specific industry allows not just practical gains but also opens up theoretical possibilities by constraining the domain, which is still large, but also easy to manipulate. By providing quantitative and qualitative evidence that contrastive learning, coupled with a large and diverse dataset, can indeed produce general multi-model industry concepts, we connect theoretical virtues with significant practical gains, and open new possibilities for scaling the horizontal deployment of machine learning systems in an effective way.
As a truly general system, FashionCLIP concepts could be used for many more tasks: for example, multimodal representations can be features in downstream systems, or directly used for zero-shot recommendations in item-to-item scenarios 76 ; classification over arbitrary labels could be used as a fast and scalable labeling mechanism, supporting probabilistic labeling 77 or data generation for multi-modal IR models 78 . While leaving Figure 9. Improbable products. By combining fashion features, brands, and items in new ways, we obtain visually realistic products with clear, zero-shot compositional semantics. From left to right: "Nike long dress", "converse with handles", "red shoes with black high heel", "keyboard pochette". www.nature.com/scientificreports/ this (and many other themes) to future iterations, we do believe this work-with its artifacts and methodology-to be a first, rounded assessment of the great potential of general, transferable, multi-modal concepts for digital commerce.
The authors are aware of the risks of multi-modal CLIP-like models in production associated with their limited robustness, as well as general issues with biases in large language models pre-trained at scale. In particular, we acknowledge that the risk of adversarial attacks on multi-modal models is an area of active research 80,81 . To the limits of our knowledge, we have no reason to believe that FashionCLIP introduces any additional risk when compared to the original CLIP. As with the original model, it should be noted that FashionCLIP appears to be susceptible to "typographical attacks" (Fig. 10). No datasets used for training or testing contain PII and/ or other sensitive user data.

Methods
Training dataset. Farfetch made available for the first time an English dataset comprising over 800 k fashion products, with more than 3k brands across dozens of object types. Compared to other large fashion datasets, our dataset is significantly more complete than DeepFashion 52 , which lacks detailed text descriptions, and even larger than CM-Fashion 33 , which has been collected without any direct involvement by Farfetch. Items are organized in hierarchical trees, producing a three-layer taxonomy: for example, trees could be something like Clothing> Dresses> Day Dresses or Clothing> Coats> Parkas, for a total of 800+ trees. As input for the image encoder, we use the standard product image, which is a picture of the item over a white background, with no humans (images follow a specific set of rules regarding the placement of the item, lights of the photo, etc., designed to highlight the item's features); as for the text, Farfetch has two types of text, highlight (e.g., "stripes", "long sleeves", "Armani") and a short description ("80s styled t-shirt"). See Fig. 3 for an example.
We create a training, validation, and test set from the catalog by random sampling products. Our final training and validation sets comprise 700 k and 50 k products respectively from 188 categories.
Training pipeline. We apply fine-tuning starting from the pre-trained CLIP with the following parameters: we use Adam Optimizer with betas in (0.9, 0.98), epsilon of 1e−6 and weight decay equal to 0.2 and three different learning rates [1e−4, 1e−5, 1e−6]. We train the models for 4 epochs, evaluate every 500 steps and select the model with the lowest validation loss for each configuration (Table 5, model selected in bold). In our preliminary tests, the model with the lowest validation loss overall did not generalize the best in the zero-shot setting. This poses an interesting question, left for future work, of how to fine-tune these large pre-trained models without losing in generalization. The pipeline has been implemented with Metaflow 82 , with training executed remotely on cloud GPUs; experiment tracking was provided by Comet 83 .
Testing datasets. We prepare the following datasets for testing purposes and to further gauge the potential impact of the model in production at scale. TEST is the test set from Farfetch containing 20k products; HOUT-C is the dataset containing a category which we excluded from training (Performance Tops), for a total of 1.5k items; HOUT-B is the dataset containing two brands which were excluded from training, for a total of 1.7k items; STLE is a merchandising dataset from Farfetch, completely independent from the catalog, that classifies 7749 items across 6 styles for gender women and 4 styles for gender men; example of styles are Classic and Figure 10. Typographical attack. FashionCLIP correctly identifies the object to the left as an "apple", but misclassifies the one to the right as "nike air", as the text acts as a confounder 79 . www.nature.com/scientificreports/ Streetwear and each item may belong to more than one style; KAGL is a subset of 50 , where each product has a white background image, a caption, and a category, for a total of 9990 items over 62 categories; F-MNIST 51 contains 10, 000 gray-scale images from 10 product classes, with pixel intensity inverted to obtain images with white background (note that these images have a size of 24 × 24 thus showing much less details than the images on which the models have been trained on). DEEP 52 contains 4000 product images that are non-standardized (i.e contains humans) from 50 categories.
Training FashionCLIP. We re-purpose the CLIP main architecture 17 , which we describe briefly in the Introduction for the sake of completeness. In the end, we obtain a multi-modal space where images and texts are jointly projected and learned: if training has been successful, we expect that, for example, the textual embedding for the string "red long dress" is actually similar (as measured by the dot product) to the image embeddings of red dresses. Table 5 shows training time, performance, and costs.

Data availability
The KAGL, F-MNIST, and DEEP datasets are publicly available. The Farfetch dataset is scheduled to be released in the near future. As part of the ongoing mission to help the retail space leverage the latest A.I. techniques and to promote multidisciplinary research in data science across industries, Farfetch is working to finalize the release of the dataset used in this study under a research-friendly license. Please check https:// github. com/ Farfe tch for updates on the data release, and reach out to the authors for preliminary inquiries. www.nature.com/scientificreports/