Cross-modal Search for Fashion Attributes€¦ · and to annotate fashion images. In this paper, we...

Cross-modal Search for Fashion Attributes

Katrien LaenenKU Leuven

Celestijnenlaan 200A3001 Heverlee, Belgium

[email protected]

Susana ZoghbiKU Leuven


[email protected]

Marie-Francine MoensKU Leuven


[email protected]

ABSTRACTIn this paper we develop a neural network which learns inter-modal representations for fashion attributes to be utilized ina cross-modal search tool. Our neural network learns fromorganic e-commerce data, which is characterized by cleanimage material, but noisy and incomplete product descrip-tions. First, we experiment with techniques to segment e-commerce images and their product descriptions into respec-tively image and text fragments denoting fashion attributes.Here, we propose a rule-based image segmentation approachwhich exploits the cleanness of e-commerce images. Next,we design an objective function which encourages our modelto induce a common embedding space where a semanticallyrelated image fragment and text fragment have a high in-ner product. This objective function incorporates similarityinformation of image fragments to obtain better intermodalrepresentations. A key insight is that similar looking imagefragments should be described with the same text fragments.We explicitly require this in our objective function, and assuch recover information which was lost due to noise and in-completeness in the product descriptions. We evaluate theinferred intermodal representations in cross-modal search.We demonstrate that the neural network model trained withour objective function on image fragments acquired withour rule-based segmentation approach improves the resultsof image search with textual queries by 198% for recall@1and by 181% for recall@5 compared to results obtained by astate-of-the-art image search system on the same benchmarkdataset.

CCS Concepts•Information systems → Image search; •Computingmethodologies → Image segmentation; Neural net-works; Natural language generation; Image representations;Cluster analysis; •Applied computing → Online shop-ping;

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

KDD ’17 August 13–17, 2017, Halifax, Nova Scotia, Canadac© 2017 ACM. ISBN 123-4567-24-567/08/06. . . $15.00

DOI: 10.475/123 4

KeywordsCross-modal search, fashion

1. INTRODUCTIONFashion e-commerce is a booming business. We currently

witness a shift from physical to digital retail. Therefore,applications that organize and retrieve fashion items havegreat economic value.

Imagine a cross-modal search tool that can be learnedautomatically from the organic and noisy data as found inwebshops. Such a cross-modal search tool performs bothtasks of image annotation, i.e., given an image, return suit-able textual descriptors, and image search, i.e., given textualdescriptors, retrieve images showing the visual characteris-tics expressed by the textual descriptors. This would notonly alleviate the workload of human annotators, but alsopromote increased access to relevant products. Currently,products are found by matching key terms in the descrip-tion of the products. Often the product description doesnot contain all attributes because they are visible in the ac-companying image, thereby hampering the search for suchattributes. The only alternative is then to navigate throughthe taxonomy of products that is offered by the e-retailer.Consequently, the cross-modal search tool would provide amore flexible way for searching products in webshops.

However, building such a cross-modal search tool is def-initely not straightforward. On the text side, we have e-commerce product descriptions which are noisy and incom-plete, making them challenging to learn from. On the im-age side, we observe that within a clothing category (e.g.dresses) garments share a high degree of shape similarity.They only differ in certain details, i.e., in their fashion at-tributes. For instance, the overall shape of two dresses isalways the same, but they can have different kinds of neck-lines, sleeve lengths, colors, . . . . Over different seasons andtrends we also notice that it is the fashion attributes whichchange appearance, while the overall garment shape remainsconstant. Therefore, a cross-modal search tool should oper-ate on the level of fashion attributes. Based on its knowledgeof fashion attributes, e.g.,“What does a V-neck look like?” or“How do we call this kind of skirt shape?”, the cross-modalsearch tool will be able to search for requested attributesand to annotate fashion images.

In this paper, we obtain this knowledge of fashion at-tributes with a neural network which learns to align fashionattributes in images and texts by embedding them into acommon, multimodal space (Figure 1). We focus on a singleclothing category: dresses. The inferred intermodal repre-

Figure 1: Both the image and text are segmentedinto fragments. These fragments represent the fash-ion attributes. Corresponding fashion attributes inthe image and text are aligned by embedding theminto a common embedding space. (Image reference:www.amazon.com)

sentations are utilized in a cross-modal search tool, whichwe evaluate on image search and image annotation.

The contributions of our work are:

• We propose a neural network alignment model to findthe latent alignment between fashion image regionsand phrases by embedding them into a common, multi-modal space. The obtained intermodal representationsallow cross-modal search of fashion attributes.

• Our proposed model learns intermodal representationsfrom organic and noisy data as found in webshops, anddoes not rely on manually curated data.

• We illustrate how similarity information of fashion im-age regions can be used to acquire better intermodalrepresentations.

• We perform cross-modal search of fashion items, i.e.,given an image, return suitable textual descriptors,and given textual descriptors, retrieve images exhibit-ing the characteristics expressed by the textual de-scriptors. This is realized with the inferred intermodalrepresentations, thus without one modality relying onthe other. Unlike previous work [18] that uses inter-modal representations of full images and texts, ourmodel works at a finer level and employs intermodalrepresentations of image regions and phrases.

• We substantially outperform the results of image searchobtained by a state-of-the-art fashion image search sys-tem. Compared to this state-of-the-art system, weachieve an increase of 196% on recall@1 and of 181%and recall@5 on the Amazon Dresses dataset.

The remainder of this paper is structured as follows. In thenext section we review existing work related to the subject.In Section 3 we explain our model architecture and trainingobjective in detail. Section 4 presents the experiments con-ducted in this work. The results of these experiments aregiven and discussed in Section 5. Finally, we conclude andprovide the future direction of this work in Section 6.

2. RELATED WORKOver the past few years, several techniques have been de-

veloped to generate image regions enclosing the objects inan image. Examples are objectness [1] and selective search[15]. However, these techniques are developed for detect-ing straightforward objects in general image scenes, whilewe want to detect more fine-grained product attributes in afashion context. Recently, there has been a lot of researchon fine-grained image segmentation. Such techniques workon fine-grained classes (e.g. different bird species) and tryto detect critical regions in the images that allow to discrim-inate between the fine-grained classes [7, 17]. Regarding thesegmentation of clothing, existing techniques rather focus ongenerating image regions containing complete fashion items(e.g. a t-shirt) instead of fine-grained fashion attributes (e.g.short sleeves, V-neck) [4].

High-dimensional feature vectors produced by convolu-tional neural networks (CNNs) are currently the most popu-lar image representations [5, 8, 16, 18]. CNNs have replacedtechniques like scale-invariant feature transform (SIFT) [9,18], and become the state-of-the-art image processing tech-nique.

To segment product descriptions into fashion attributes,Zoghbi et al. [18] propose to filter them, either by using apart-of-speech (POS) tagger to only retain adjectives, ad-verbs, verbs, and nouns or alternatively by using a domainspecific vocabulary and to only retain phrases present in thisvocabulary.

Words and phrases are represented with low-dimensionalfeature vectors which capture the syntax and semantics ofthat word/phrase. Recently, word vectors acquired withneural networks have become very popular. In [13, 14] twosimple two-layered neural networks are proposed to get wordrepresentations: the Skip-gram model and the continuousbag-of-words (CBOW) model. The context information cap-tured by these models is rather limited though. In contrast,bidirectional recurrent neural networks (BRNNs)[5] and es-pecially long short-term memories (LSTMs), are able to cap-ture long-term dependencies in full sentences and discourses.

Our attribute alignment task falls into a general cate-gory of learning from multimodal data. However, currentresearch on learning from multimodal fashion e-commercedata is still very limited. Techniques have been developedthat, given a real-world fashion image as query, return fash-ion items on e-commerce websites which are similar [4] oridentical [6] to those in the provided real-world image. Whilethese techniques result in an image retrieval setting whichis only based on visual analysis, we use both fashion imagesand product descriptions to learn a model which performscross-modal retrieval of fashion items. In [11, 12] the focusis on annotating images with keywords, and learning fromnoisy and incomplete product descriptions, which is similarto our work. In contrast with our work, they only use hand-crafted image features (e.g. SIFT), they do not address thetask of image search, and their dataset is less than half thesize of ours. Most closely related is the work of Zoghbi etal. [18], who learn a cross-modal search tool from multi-modal fashion e-commerce data. They experiment with twodifferent models to infer intermodal correspondences: canon-ical correlation analysis (CCA), which explicitly models thecorrelations between language and visual data, and bilin-gual latent Dirichlet allocation (BiLDA), a technique thatbridges the two modalities through probabilistic latent top-

ics. Their BiLDA model constitutes the state-of-the-art forcross-modal search of fashion items. However, while theirmodel is based on the intermodal correspondences of fullimages and texts, we try to find the intermodal correspon-dences of image regions and phrases. Additionally, we willfocus on neural networks to bridge the two modalities. Ourobjective function is inspired by the one of Karpathy et al.[5], who train a neural network which projects objects in vi-sual indoor and outdoor scenes and their textual descriptionsinto a common embedding space to discover their latentalignment. This objective function uses local co-occurrenceinformation of image regions and words and global image-text correspondence to infer the unknown alignment at thelevel of image regions and words, which is what we try toachieve in this work. However, while their model expectsclean and complete textual annotations with general con-tent of visual scenes composed of prominent objects, ourmodel works with organic fashion e-commerce data as foundin webshops, where fashion products are characterized by amultitude of fashion attributes, and where product descrip-tions barely have grammatical structure, use a sector specificvocabulary, and are often noisy and incomplete. Moreover,in order to learn a better alignment, we propose to incorpo-rate similarity information of detected image regions acrossall images of our collection. In the past, including imagesimilarity information has proved to be effective in visualrelationship detection [10] and in label propagation [2].

3. METHODOLOGYIn this paper we propose a model to align image regions

and phrases denoting fashion attributes (Figure 2).First, we discuss how we detect fashion attributes in fash-

ion images and product descriptions. Next, we elaborate onthe objective function that is used to learn to align fash-ion attributes across different modalities. Finally, we de-scribe how cross-modal search of fashion images and texts isachieved with the acquired intermodal representations.

3.1 Image Segmentation and Representation

3.1.1 Selective Search SegmentationFirst, we use selective search [15] to get the regions of

the image showing the fashion attributes. We consider allgenerated image regions and the full image as the imagefragments. Consequently, each image has a different numberof image fragments.

3.1.2 Rule-based SegmentationImages on e-commerce websites show clothing items either

on their own or worn by a model in different poses. In eithercase, the item is shown clearly and fully, usually on a whitebackground. Hence, when the item is worn by a model, weassume that independently of the pose, the model alwaysfaces the camera and stands straight.

We experiment with a rule-based segmentation approachbased on the geometry of garments in a clothing category.We start from the insight that the geometry of garments in acertain clothing category (e.g. dresses) gives us informationabout where to find the fashion attributes. More precisely,when we have the information that the garment is displayedin a straight position and in frontal view, we know the ap-proximate location of each garment part and thus of eachfashion attribute. Therefore, to find the fashion attributes

Table 1: Expected locations of the parts and fashionattributes of a dress. A location is a rectangle rep-resented as (x, y), w, h with (x, y) the coordinates ofthe upper left corner, w the width and h the heightof this rectangle. W and H refer to respectively thewidth and the height of the bounding box surround-ing the full dress.Dress part Approximate location Expected fashion attributestop (0, 0), W , 0.35H neckline, sleeve length, top

shape, color/print, accessoriesfull skirt (0, 0.30H), W , 0.70H skirt shape, skirt length,

color/print, accessoriesskirt above knee (0, 0.25H), W , 0.40H skirt shape, skirt length,

color/print, accessoriesneckline (0, 0), W , 0.20H kind of neckline,

color/print, accessoriesleft sleeve (0, 0), 0.50W , 0.50H sleeve length, sleeve form,

color/print, accessoriesright sleeve (0.50W , 0), 0.50W , 0.50H sleeve length, sleeve form,

color/print, accessories

in an image of a dress, we first find a bounding box enclosingthe full dress. We do this by transforming the image to theLab color space and thresholding the image to see which pix-els belong to the dress and which to the background. Then,the bounding box is spanned by the leftmost, upmost, right-most and downmost pixel belonging to the dress. If the areaof this bounding box is more than 5 times smaller than thearea of the complete image, we take the bounding box en-closing the complete image instead, since in this case usuallysomething has gone wrong when thresholding. The regioninside the bounding box serves as the first image fragment.Next, we use our knowledge about the geometry of a dress tosegment the region inside the bounding box in 6 more imagefragments containing respectively the top, the full skirt, thepart of the skirt above the knee, the neckline, the left sleeveand the right sleeve (Table 1). With this rule-based seg-mentation approach each image has 7 image fragments cor-responding to locations where fashion attributes are likelyto be found.

3.1.3 Image RepresentationWe represent the image fragments with the BVLC Caf-

feNet CNN Model1 [3]. This CNN is pre-trained on Ima-geNet and only differs from the AlexNet model [8] in thatit is not trained with relighting data-augmentation and thatthe order of the pooling and normalization layers is switched.The image fragment representations are acquired as the ac-tivation weights of the last fully connected layer before thesoftmax layer, which have dimension 4096 in the CNN ar-chitecture.

3.2 Text Segmentation and RepresentationWe train word embeddings on the product descriptions

in the Amazon Dresses dataset using the Skip-gram model[13]. These word embeddings allow us to learn a single wordembedding for multiword fashion expressions (e.g. droppedwaist) and to better capture the syntax and semantics offashion-related phrases.

Then, to acquire the text fragments, we first convert allwords to lowercase, and remove all non-alphanumeric char-acters and words which occur less than 5 times in the train-ing set. Next, we filter the product descriptions to only

1https://github.com/BVLC/caffe/tree/master/models/bvlcreference caffenet

Figure 2: Model overview. Left: An image is segmented into regions, which together with the full imagefunction as the image fragments. The image fragments are embedded through a CNN. Right: A productdescription is filtered with the Zappos glossary. Each Zappos phrase is a text fragment and is representedwith a word embedding. Middle: The alignment model learns to project semantically related image and textfragments to vectors into a common embedding space which have a high inner product, as depicted by darkshades of grey. The resulting intermodal representations form the core building blocks for a cross-modalsearch tool.

retain fashion-related phrases. Following the approach ofZoghbi et al. [18] we use the glossary of the online clothingshop Zappos2, which contains both single-word (e.g. strap-less) and multiword expressions (e.g. little black dress) re-lated to fashion. Although this removes much noise from theproduct descriptions, they still remain quite noisy. Remain-ing phrases might still refer to parts of the garment whichare not visible in the image (e.g. the back or side) or de-scribe properties of the garment which are not displayed (e.g.all possible colors). Afterwards, we consider each Zapposphrase as a text fragment. Hence, the number of text frag-ments differs for different product descriptions, and someproduct descriptions might even have no text fragments.

3.3 Alignment ModelAfter segmentation, an image-text pair is represented as

a set of image fragments and a set of text fragments. Weknow that some image fragments and text fragments in thesesets correspond but it is unknown which ones. Therefore, wetrain a neural network to induce a common embedding spacewhich uncovers the intermodal correspondences.

This neural network learns parameters θ = {Wv, bv,Ws, bs}to project an image fragment vi and text fragment sj to re-spectively vector

vi = Wv vi + bv, (1)

and vector

sj = f(Wssj + bs). (2)

in the common embedding space, which have a high innerproduct if the corresponding image and text fragment aresemantically similar or have a low inner product otherwise.Hence, we interpret the inner product of an image fragment

2http://www.zappos.com

and text fragment in the common embedding space as a mea-sure of their semantic similarity. Here, Wv has dimensionsh×4096 and Ws has dimensions h×dim, where h is the sizeof the common embedding space and dim is the dimensionof the word vectors. Parameters bv and bs are bias terms.The activation function f is set to the rectified linear unit(ReLU)3, which computes f(x) = max(0, x).

To find the intermodal correspondences, the neural net-work is trained with an objective function consisting of threedifferent objectives: the fragment alignment objective [5],the global ranking objective [5] and the image cluster con-sistency objective.

Following Karpathy et al. [5], we use the fragment align-ment objective CF (θ) which uses local co-occurrence in-formation to infer which image fragment and text fragmentshould be aligned. This objective is formulated as

CF (θ) = minyij

C0(θ) (3)

C0(θ) =∑i

∑j

max(0, 1− yijvTi sj) (4)

subject to∑i∈pj

yij + 1

2≥ 1 ∀j (5)

yij = −1 ∀i, j subject to mv(i) 6= ms(j) (6)

and yij ∈ {−1, 1}. (7)

It considers all image fragments vi and text fragments sjin the training set. Variable yij reflects whether vi and sjshould be aligned (yij = 1) or not (yij = −1), and conse-quently whether their similarity score vTi sj should be en-couraged to be more than 1 or less than -1 (Eq. 4). To de-cide the value for variable yij , the fragment alignment objec-

3Experiments showed that only using the ReLU activationfunction at the text side works best.

tive uses co-occurrence information of the fragments duringtraining. mv(i) and ms(j) return the index (∈ {1, ..., N}) ofthe image and sentence that the fragments vi and sj belongto. When vi and sj do not belong to the same image-textpair, they should not be aligned (Eq. 6). For the ones thatdo belong to the same image-text pair, the objective triesto find the variables yij which minimize Eq. 4 (Eq. 3).Here, the only constraint is that each text fragment shouldbe aligned with at least one image fragment it occurs with(i.e., with at least one image fragment in the positive bagpj of sj) (Eq. 5). Since this objective benefits from a goodinitialization of the intermodal representations, this objec-tive is trained with yij = 1 for all vi and sj of correspondingimage-text pairs during the first 15 epochs. Later, the objec-tive is changed to Eq. 3 to refine the fragment alignments.

The global ranking objective CG(θ) [5] uses global in-formation about fragments, and enforces that correspondingimage-text pairs (k = l) should have a higher similarity score(by a margin ∆) than non-corresponding ones. The globalranking objective is given by the following equation

CG(θ) =∑k

[∑l

max(0, Skl − Skk + ∆)︸︷︷︸rank images

+∑l

max(0, Slk − Skk + ∆)︸︷︷︸rank texts

],

(8)

where the similarity score Skl of an image k and text l iscomputed based on the similarity scores of their respectivefragments fk and fl :

Skl =1

(| fl | + n)

∑j∈fl

maxi∈fk

vTi sj . (9)

Here, n is a smoothing term to prevent shorter texts of hav-ing an advantage over longer texts.

The image cluster consistency objective CI(θ) at-tempts to improve the intermodal representations based onimage fragment similarity information. The objective triesto exploit the fact that similar image fragments should bealigned with the same text fragments. We find similar imagefragments by clustering the image fragments in C clustersbased on cosine distance with k-means clustering. Then, weexpress the image cluster consistency objective as follows:

CI(θ) =

N∑n=1

∑i

∑j

(1− vTi ci‖vi‖‖ci‖

)|vTi sj − cTi sj |. (11)

This objective considers all N image-text pairs in the train-ing set, and for each pair sums over its image fragmentsvi and text fragments sj . Then, it encourages the differ-ence between the similarity score of image fragment vi andtext fragment sj and the similarity score of similar imagefragment ci and that same text fragment to be as small aspossible. Here, we take as similar image fragment ci thecentroid of the cluster of vi

4. We weight the difference insimilarity scores with a factor based on the cosine similarityof the two image fragments. With this weighting factor, the

4We also experimented with the medoid and the nearestneighbour in the same cluster.

objective tries to prevent image fragments in the same clus-ter which are actually not so similar to be aligned with thesame text fragments, and thus guards against introducingerrors because of defects in the clustering.

Ultimately, the complete objective function to train ourneural network is

C(θ) = CF (θ) + γCI(θ) + βCG(θ) + α‖θ‖22, (12)

where θ refers to the network parameters and α, β and γ arehyper parameters to be cross-validated, which we set basedon a validation set.

3.4 Cross-modal SearchWe can use the inferred intermodal representations for

image search and image annotation. In image search weretrieve images that display the fashion attributes expressedin a textual query l. We compute the similarity score Skl oftext l with all images k and return the top K images withthe highest similarity scores. In image annotation, weannotate an image query k with suitable fashion attributes.We compute the similarity score Skl of image k with alltextual descriptors l and retrieve the top K descriptors withthe highest similarity scores.

4. EXPERIMENTAL SETUP

4.1 DatasetWe train our model on the Amazon Dresses dataset, which

was collected from the Amazon webshop by Zoghbi et al. [18]between January and February 2015. This dataset consistsof 53 689 images of dresses and their product descriptions.The images show dresses from different categories, such asbridesmaid, casual, mother of the bride, night out and cock-tail, special occasion, wear to work, and wedding. The prod-uct descriptions consist of the surrounding natural languagetext in the webshop, like the title, features and editorialcontent. Hence, this dataset contains natural multimodale-commerce data, where the product descriptions are noisy,incomplete and can contain misspellings, incorrect grammarand incorrect punctuation.

We use 48 689 image-text pairs for training, 4000 for val-idation and 1000 for testing. During testing, we evaluatethe quality of the inferred intermodal representations in across-modal retrieval setting. For image search, the textualqueries are the complete product descriptions of the test im-ages. In the absence of a complete ground truth referencecollection, we consider as the ground truth for each tex-tual query the corresponding test image. Likewise for imageannotation, the visual queries are the test images and theground truth for a visual query is the complete product de-scription of the test image. As such, we follow the exactsame setup as Zoghbi et al. [18].

4.2 ExperimentsFirst, we identify the fashion attributes in the images with

two image segmentation techniques: selective search andrule-based segmentation based on garment geometry. Weuse k-means clustering on the resulting image fragments tofind C groups of similar image fragments. In our experi-ments C = 500, 2500, 5000, 7500, 10000, 12500, 15000, 17500and 20000, and we found that C = 10000 works best. Next,

we train 300-dimensional word embeddings with the Skip-gram model, for which code is publicly available on GitHub5.We concatenate the product descriptions in the training setof the Amazon Dresses dataset, convert all words to low-ercase, and remove non-alphanumeric characters. We trainthe Skip-gram model on the resulting text, where we treateach fashion phrase as a single word. We consider a contextsize of 5. Then, we filter the product descriptions with theZappos glossary and consider the remaining Zappos phrasesas our text fragments. Afterwards, we input the image andtext fragments in our alignment model, and train it with thefragment alignment objective and global ranking objectiveto induce a 1000-dimensional common embedding space. Weuse stochastic gradient descent with mini-batches of 100, afixed learning rate of 10−5, a momentum of 0.90, and make20 epochs through the training data. Here, a smoothingterm n in Skl of 10, a margin ∆ in CG(θ) of 40, and a factorβ in C(θ) of 0.50 were found to work well. Finally, we inves-tigate the influence of image similarity information on thequality of the intermodal representations, by including theimage cluster consistency objective in the objective function.We achieve the best results with γ set to 0.25. We evalu-ate the inferred intermodal representations in image searchand image annotation, and investigate the performance ofthe image segmentation techniques and the proposed novelobjective.

In image search, we retrieve for each textual query thetop K most likely test images. We evaluate by computingrecall@K for K = 1, 5, 10, 20, 40. Precision@K does not saymuch about performance, since there is only one relevantimage for each textual query. To qualitatively evaluate ourresults, we ourselves construct realistic textual queries ask-ing for different colors, prints, shapes, fabrics and occasions.We avoid infrequent attributes, since these might not occurin the test set. For each textual query, we retrieve the top5 most likely test images. We consider a retrieved image asrelevant if it exhibits all requested attributes.

In image annotation, we retrieve for each visual query thetop K most likely product descriptions. We evaluate imageannotation by computing recall@K for K = 1, 5, 10, 20, 40.Precision@K does not say much about performance here ei-ther, since there is only one relevant product description foreach visual query. To get further insight in the annotationcapabilities of our model, we also show the top 5 most likelyZappos phrases for some visual queries.

It is important to note that for both tasks, recall com-puted at the cut-off of K items regards a very strict evalu-ation. Because this evaluation relies on incomplete productdescriptions, it might be that we retrieve an image or a tex-tual description for an image, which is not present in thecurrent incomplete ground truth reference collection, butwhich is relevant. Hence, we might retrieve an image, whichsatisfies the textual description given as query, or we mightretrieve an annotation, which is not (part of) the originalproduct description of an image, but still accurately de-scribes it. Therefore, the actual evaluation results mightbe higher than those reported in this paper.

4.3 Comparison with other ModelsWe compare our alignment model to the CCA model and

BiLDA model of Zoghbi et al. [18], the latter of which con-stitutes the state-of-the-art for cross-modal search of fashion

5https://github.com/tensorflow/tensorflow

Table 2: Image search results. R@K is [email protected] reported results are for C = 10000.

Image searchModel R@1 R@5 R@10 R@20 R@40CCA model (Zoghbi et al. [18])

2.00 8.10 11.70 17.70 28.00BiLDA model (Zoghbi et al. [18])

2.50 7.80 12.80 18.00 28.900selective search, CF (θ) and CG(θ)

6.40 18.60 26.40 36.90 50.10rule-based segmentation, CF (θ) and CG(θ)

9.10 20.90 29.20 42.70 57.70rule-based segmentation, CF (θ), CG(θ)and CI(θ) 7.40 21.90 32.60 43.10 57.70

items. For image annotation, our baseline is a linear sup-port vector machine (SVM) trained in Zoghbi et al. [18]with the scikit-learn toolkit6. This SVM is trained on theCNN representations of the training images of the AmazonDresses dataset using a one-vs-rest scheme, i.e., they trainone classifier for each Zappos phrase.

5. RESULTS AND DISCUSSION

5.1 Image SearchOur image search results are presented in Table 2. These

results show that our naive rule-based segmentation tech-nique outperforms selective search. This proves that our as-sumptions made about e-commerce fashion images are valid.Hence, we can rely on the geometry of the garments in aproduct category and the cleanness of e-commerce imagesto locate fashion attributes in these images.

We also observe that including image similarity informa-tion in the objective function produces improved intermodalrepresentations. When the image cluster consistency objec-tive is incorporated in the objective function, the alignmentmodel outperforms the one trained with only the fragmentalignment and global ranking objective on all image searchmetrics except recall@1. However, users usually want tosee more items than one, so recall@5 or recall@10 are morerelevant metrics in terms of usability.

Hence, our best model is the neural network trained withan objective function composed of the fragment alignment,global ranking and image cluster consistency objective onimage fragments acquired with rule-based segmentation. Ourbest model outperforms both the CCA model and state-of-the-art BiLDA model of Zoghbi et al. [18]. Compared to thestate-of-the-art, our best model achieves an increase of 196%on recall@1, of 181% and recall@5, of 155% on recall@10, of139% on recall@20, and of 100% on recall@40. The BiLDAmodel uses the topic similarity between the textual queryand the target image collection to find the most relevant im-ages. This makes the image retrieval model rather coarse.The CCA model explicitly models the correlations betweenCNN features and discrete word representations. In con-trast, our neural network model exploits the expressivenessof real-valued representations on both the visual and tex-tual domains via semantic embeddings. Our results indicatethat the common embedding space induced by our neuralnetwork encodes the latent semantic alignment of languageand visual data in a more meaningful way than the spaceinduced by either cross-modal topics or correlations.

In the absence of a complete ground truth reference col-lection, recall@K regards a rather strict evaluation. There-

6http://scikit-learn.org

Figure 3: Image search examples. For each textual query, the top 5 retrieved images are shown. (Imagereference: www.amazon.com)

fore, we also present qualitative results to assess our model’sperformance (Figure 3). The query “wedding, dress, short,sweetheart, lace, white” only returns white wedding dresses.Four of them are short, the fifth one is medium length. Threeof the dresses have a sweetheart neckline, and two have lace(the third and fifth dress). The query “sweetheart, A-line,dress, bridesmaid, homecoming” retrieves four dresses witha sweetheart neckline. They could all be considered A-line,but the shape of the first and second dress is somewhat be-tween A-line and empire. For the query“dress, sleeveless, V-neck, floral print, orange, sheath” all retrieved dresses havea floral print and are sleeveless. Two of the dresses have asheath model, and only one has a V-neck. Retrieving dressesin the requested fabric is more difficult, but it seems that ourmodel has some idea what a certain fabric looks like. Whenasked for a denim dress, our model is able to find three.Of course, denim is a fabric with clear characteristics. Butalso when asked for dresses in polyester, it is plausible thatthe returned dresses are made of this fabric. However, it ishard if not impossible to identify the exact fabric from theimages, even for a human. When asked for dresses suitablefor a specific occasion, like ‘summer’ or ‘bridesmaid’, themodel comes up with reasonable suggestions, although thisis rather subjective. Overall, when looking at the query re-sults from left to right and top to bottom, respectively one,four, one, no, three and three dresses have all requestedattributes. We conclude that our model is able to retrievedresses with the requested color, print and shape. Retrievingdresses in the correct fabric and distinguishing between sim-ilar fashion attributes (e.g. ‘empire’ and ‘A-line’ or ‘maxi’and ‘high low’) appear to be the main difficulties. How-ever, it is remarkable that such nice results can be achievedwithout the matching of terms between the textual query

and product descriptions, but instead through the use ofinferred intermodal representations.

5.2 Image AnnotationOn image annotation, our best model exceeds the SVM

baseline, but is surpassed by both the CCA and BiLDAmodel of Zoghbi et al. [18]. Image annotation seems tobenefit from a probabilistic topic representation. Given anunseen image, the BiLDA model infers its topic distributionand generates descriptive words via the topics, in a proba-bilistic fashion. In contrast, our model projects every unseenimage onto an algebraic multimodal space and finds the clos-est text fragments. Our induced common, multimodal spacehas been proved quite useful for image search, as discussedin the previous section, however for image annotation, al-gebraically searching for close-by words might not encodeenough semantic expressiveness to retrieve the relevant textfragments. In the future, we may explore how to combinethese two modelling paradigms, probabilistic topics and al-gebraic spaces, into one framework that exploits the benefitsof both.

We show examples of image annotations generated by ourmodel in Figure 4. We see correct annotations regardingcolors, prints, shapes and accessories. As for image search,finding the correct fabric and distinguishing between similarfashion attributes are things our model still struggles with.Even so, these examples demonstrate that even if our im-age annotation results are lower than the state-of-the-art,our model is nevertheless capable of generating meaningfulannotations.

5.3 Image SegmentationFigure 5 shows and compares the image fragments gener-

Figure 4: Image annotation examples. For each visual query, the top 5 annotations are shown. In green:annotations which are part of the original product description. In blue: annotations which are not in theoriginal product description, but which are correct. In black: annotations which are incorrect or unknownbased on what is displayed in the image. (Image reference: www.amazon.com)

Figure 5: Image segmentation results. Left: Image segmentations acquired with rule-based segmenta-tion. Right: Image segmentations of the same images, but when using selective search. (Image reference:www.amazon.com)

ated with selective search and our rule-based segmentationtechnique.

While our rule-based segmentation technique is rathernaive, it works very nicely on the e-commerce fashion im-ages. By exploiting our knowledge of garment geometry andthe fact that e-commerce images are usually very clean, weare able to acquire more meaningful and complete image seg-mentations than those produced by the selective search algo-rithm. Additionally, our rule-based segmentation approachhas the benefit of generating the same number of regionsfor each image, where we approximately know which fash-ion attributes can be observed in each region. In contrast,selective search [15] segments an image in regions based oncolor similarity, texture similarity and goodness of fit. Theresulting regions can focus on smaller regions of interest thanin rule-based segmentation. However, we observe that withselective search multiple fashion attributes are not enclosedin any region (e.g. no fragment of the top or neckline) andthat some regions show parts of the image that are irrel-evant (e.g. only the head). In addition, selective searchalso produces many near duplicates. Therefore, we preferour proposed rule-based approach over selective search tosegment e-commerce fashion images.

6. CONCLUSIONIn this paper, we have proposed a neural network that

learns intermodal representations for fashion attributes fromorganic e-commerce data. We illustrated how we can relyon the cleanness of e-commerce images and the geometryof garments to locate fashion attributes in fashion images.Our proposed rule-based segmentation technique to segmente-commerce images of dresses into regions outperforms se-lective search. Additionally, we have demonstrated howsimilarity information of fashion image regions can be usedto acquire better intermodal representations for fashion at-tributes. We introduced the image cluster consistency objec-tive, which encourages similar image fragments to be alignedwith the same text fragments, and report increased resultswhen adding this objective to our objective function. Theobtained intermodal representations allow cross-modal searchof fashion attributes, and we have shown the quality ofthese representations in image search and image annota-tion. In image search, our model substantially outperforms astate-of-the-art fashion image search system. Generally, weshowed nice search results for colors, prints, shapes, and ac-cessories, but the identification and retrieval of fabrics needsfurther refinement.

In the future, we want to experiment with other ways tooptimize the intermodal representations. While includingimage similarity information is one way to compensate forerrors caused by the noise and incompleteness of productdescriptions, we also plan to investigate other ways to dealwith this.

7. REFERENCES[1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the

objectness of image windows. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI),34(11):2189–2202, November 2012.

[2] T. D. Bui, S. Ravi, and V. Ramavajjala. Neural graphmachines: Learning neural networks using graphs.CoRR, abs/1703.04818, 2017.

[3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,J. Long, R. Girshick, S. Guadarrama, and T. Darrell.Caffe: Convolutional architecture for fast featureembedding. In Proceedings of the 22Nd ACMInternational Conference on Multimedia, pages675–678, 2014.

[4] Y. Kalantidis, L. Kennedy, and L.-J. Li. Getting thelook: Clothing recognition and segmentation forautomatic product suggestions in everyday photos. InProceedings of International Conference onMultimedia Retrieval (ICMR), April 2013.

[5] A. Karpathy and L. Fei-Fei. Deep visual-semanticalignments for generating image descriptions. In IEEEConference on Computer Vision and PatternRecognition (CVPR), pages 3128–3137, 2015.

[6] M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, andT. L. Berg. Where to buy it: Matching street clothingphotos in online shops. In The IEEE InternationalConference on Computer Vision (ICCV), pages3343–3351, December 2015.

[7] J. Krause, H. Jin, J. Yang, and L. Fei-Fei.Fine-grained recognition without part annotations. InIEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2015.

[8] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenetclassification with deep convolutional neural networks.In Proceedings of the 25th International Conference onNeural Information Processing Systems (NIPS), pages1097–1105, 2012.

[9] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. International Journal ofComputer Vision (IJCV), 60(2), November 2004.

[10] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei.Visual relationship detection with language priors. InEuropean Conference on Computer Vision, 2016.

[11] R. Mason and E. Charniak. Annotation of onlineshopping images without labeled training examples.North American Chapter of the ACL HumanLanguage Technologies, 2013.

[12] R. Mason and E. Charniak. Domain-specific imagecaptioning. In Proceedings of the EighteenthConference on Computational Natural LanguageLearning, CoNLL 2014, Baltimore, Maryland, USA,June 26-27, 2014, pages 11–20, 2014.

[13] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean.Efficient estimation of word representations in vectorspace. CoRR, abs/1301.3781, 2013.

[14] T. Mikolov, I. Sutskever, K. Chen, and J. Corrado, G.S.and Dean. Distributed representations of words andphrases and their compositionality. In Proceedings ofthe 26th International Conference on NeuralInformation Processing Systems (NIPS), pages3111–3119, 2013.

[15] J. Uijlings, K. van de Sande, T. Gevers, andA. Smeulders. Selective search for object recognition.International Journal of Computer Vision (IJCV),2013.

[16] M. Zeiler and R. Fergus. Visualizing andunderstanding convolutional networks. In ComputerVision - ECCV 2014 - 13th European Conference,Zurich, Switzerland, September 6-12, 2014,Proceedings, Part I, pages 818–833, 2014.

[17] N. Zhang, J. Donahue, R. Girshick, and T. Darrell.Part-based r-cnns for fine-grained category detection.In Computer Vision - ECCV 2014 - 13th EuropeanConference, Zurich, Switzerland, September 6-12,2014, Proceedings, Part I), pages = 834-849, year =2014,.

[18] S. Zoghbi, G. Heyman, J. C. Gomez, and M.-F.Moens. Fashion meets computer vision and nlp ate-commerce search. International Journal of Computerand Electrical Engineering (IJCEE), 8(1):31–43,February 2016.

Cross-modal Search for Fashion Attributes€¦ · and to annotate fashion images. In this paper, we...

Documents

Transcript of Cross-modal Search for Fashion Attributes€¦ · and to annotate fashion images. In this paper, we...