forked from TwoDigits/logo-classification-capstone-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcapstone_project_report.tex
823 lines (752 loc) · 44.3 KB
/
capstone_project_report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
\documentclass{scrartcl}
\usepackage{url,hyperref}
\usepackage{listings}
\usepackage{color,float,caption}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{longtable}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
commentstyle=\color{codegreen},
keywordstyle=\color{magenta},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\small,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
numbers=left,
numbersep=5pt,
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2
}
\lstset{style=mystyle,basicstyle=\small\ttfamily}
\setkomafont{disposition}{\normalcolor\bfseries}
\renewcommand{\thesection}{\Roman{section}}
\begin{document}
\title{Machine Learning Engineer Nanodegree}
\author{Ann-Kristin Juschka}
\subtitle{Capstone Project}
\maketitle
\section{Definition}
\subsection*{Project Overview}
In this Nanodegree, \emph{Convolutional Neural Networks (CNNs)} are introduced that are most commonly used to analyze visual data.
In the deep learning project of the Nanodegree, we train CNNs to detect dogs in images and to classify dog breeds.
In general, object detection and classification is a classical task in machine learning.
The aim of this project is to use CNNs to detect and classify brands in images with logos. For this, usually CNNs are trained with the dataset \emph{FlickrLogos-27} \cite{flickrlogos27} and \emph{FlickrLogos-32} \cite{flickrlogos32}; e.g. in \cite{BIANCO201723,DBLP:journals/corr/IandolaSGK15}. We are also interested in this task as we later wish to \emph{analyze sentiments} in twitter tweeds with pictures where a given company's logo is detected.
Due to time and size limits, we focus in this project on \emph{logo detection and classification}.
\subsection*{Problem Statement}
While most logo datasets like \emph{FlickrLogos-27} \cite{flickrlogos27} and \emph{FlickrLogos-32} \cite{flickrlogos32} contain raw original logo graphics, we want to train our CNN on the in-the-wild logo dataset \emph{Logos in the Wild} \cite{logosinthewild} whose images contain logos as natural part. As this dataset includes in total 11,054 images with 32,850 annotated logo bounding boxes of 871 brands, it should be possible to train a CNN that achieves a high accuracy or mean average precision (map).
This is a challenging task as the regions containg logos are small.
The main goal of this project is to use CNNs to \emph{classify the brand and company logos} from the Logos in the Wild dataset with high accuracy and mean average precision (map).
For this, we first train an own CNN architecture as done in the dog breeds project of the Nanodegree.
% with a convolutionary layer, a max-pooling layer, another convolutionary layer, a max-pooling layer, a global-average pooling layer and a final fully connected layer.
Furthermore, to improve loss and accuracy we make use of the standard CNN architectures \emph{VGG19, Resnet50, InceptionV3, or Xception} that are already trained on the \emph{ImageNet} database \cite{kerasapplications}.
Moreover, as the dataset comes with annotations in Pascal-VOC style, if time permits we will also train a \emph{Faster Region-based Convolutional Neural Network (Faster R-CNN)} \cite{DBLP:journals/corr/RenHG015} for two stages: First, an \emph{Region Proposal Network (RPN)} for \emph{logo detection}, and second, a classifier like VGG19 for logo classification of the candidate regions.
This Faster R-CNN will be evaluated using the \emph{mean average precision (map)} metric.
\subsection*{Metrics}
For purely logo classification, we seek to achieve a high accuracy. \emph{Accuracy} is simply the number of correct predictions divided by the total number of predictions.
For logo detection, we need a different metric: \emph{mean average precision (map)} that is the mean over all classes, of the interpolated
\emph{average precision} \cite{everingham2010pascal} for each class. Recall that
\[
\textup{precision}=\frac{\textup{true positives}}{\textup{true positives}+\textup{false positives}},\quad \textup{recall}=\frac{\textup{true positives}}{\textup{true positives}+\textup{false negatives}}.
\]
Considering the precision-recall curve for a given threshold, and the interpolated precision $\textup{precision}_\textup{interpolated}(\textup{recall}_i)=\max_{\tilde{r},\tilde{r}\geq \textup{recall}_i}\textup{precision}(\tilde{r})$ of the 11 values $\textup{recall}_i=\{0,0.1,0.2,\ldots,0.9,1.0\}$. Then
\[
\textup{average precision}=\frac{1}{11}\sum_{1}^{11}\textup{precision}_\textup{interpolated}(\textup{recall}_i)\qquad\cite[\S\,4.2]{everingham2010pascal}.
\]
\section{Analysis}
%(approx. 2-4 pages)
\subsection*{Data Exploration}
To train our model, we want to use the recent logo dataset \emph{Logos in the Wild} \cite{logosinthewild}. As mentioned above, the lastest version of this dataset (v2.0) contains 11,054 images with 32,850 annotated logo bounding boxes of 871 brands and it is collected by performing Google image searches with well-known brand and company names directly or in combination with a predefined set of search terms like ‘advertisement’, ‘building’, ‘poster’ or ‘store’.
The logo annotations are in Pascal-VOC style.
As stated by the creators of the Logo in the Wild dataset, this dataset has 4 to 608 images per searched brand,
238 of 871 brands occur at least 10 times, and there are up to 118 logos in one image. Unfortunately, the dataset provides only the links to the images, and some of these images already disappeared.
As we later want to detect logos in arbitrary pictures from twitter tweeds, this large in-the-wild logo dataset still fits best to our goal.
Instead of downloading ourselves the images from the different urls provided in the Logos in the Wild dataset, we download the \emph{QMUL-OpenLogo Dataset} \cite{DBLP:journals/corr/abs-1807-01964}, which contains Logos in the Wild as a subset including all available JPEG files.
We run a simple Python script to analyze the downloaded JPEG images in the Logos in the Wild dataset, and we find that in fact there are in total 821 brands, where the maximal number of logos per given brand is 1,928 for Heineken, and the minimum number is 1. Moreover, the image heineken/img00042.jpg contains the maximum of 118 logos, while in general every image contains at least one logo.
%
\subsection*{Exploratory Visualization}
\begin{figure}[h!]
\includegraphics[width=\textwidth]{samples.jpg}
\caption[Bildbeschreibung]{\url{https://www.iosb.fraunhofer.de/servlet/is/78045/}}
\label{sampleImage}
\end{figure}
%\vspace*{-3em}%\captionof{figure}{\url{https://www.iosb.fraunhofer.de/servlet/is/78045/}}
While the images on the right contain logos of a single brand, the image on the left shows a sample image that includes bounding boxed for logos of different brands. In particular, an image can contain different types of logos of a brand. The following are the annotations in the Pascal-VOC style in the XML file corresponding to the image on the left hand side:
\begin{minipage}{.45\textwidth}
%<object>
% <name>six</name>
% <pose>Unspecified</pose>
% <truncated>0</truncated>
% <difficult>0</difficult>
% <bndbox>
% <xmin>633</xmin>
% <ymin>327</ymin>
% <xmax>671</xmax>
% <ymax>365</ymax>
% </bndbox>
%</object>
\begin{lstlisting}[basicstyle=\fontsize{7.5}{8}\ttfamily,language=xml]
<annotation>
<folder>0samples</folder>
<filename>img000009</filename>
<path>\0samples\img000009.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>718</width>
<height>535</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>starbuckscoffee</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>466</xmin>
<ymin>133</ymin>
<xmax>709</xmax>
<ymax>287</ymax>
</bndbox>
</object>
<object>
<name>starbucks-symbol</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>193</xmin>
<ymin>270</ymin>
<xmax>261</xmax>
<ymax>360</ymax>
</bndbox>
</object>
\end{lstlisting}
\end{minipage}\hfill
\noindent\begin{minipage}{.45\textwidth}
%<object>
% <name>tchibo</name>
% <pose>Unspecified</pose>
% <truncated>0</truncated>
% <difficult>0</difficult>
% <bndbox>
% <xmin>8</xmin>
% <ymin>312</ymin>
% <xmax>34</xmax>
% <ymax>351</ymax>
% </bndbox>
%</object>
\begin{lstlisting}[basicstyle=\fontsize{7.5}{8}\ttfamily,language=xml,firstnumber=38]
<object>
<name>starbucks-symbol</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>420</xmin>
<ymin>423</ymin>
<xmax>464</xmax>
<ymax>513</ymax>
</bndbox>
</object>
<object>
<name>six</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>633</xmin>
<ymin>327</ymin>
<xmax>671</xmax>
<ymax>365</ymax>
</bndbox>
</object>
<object>
<name>tchibo</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>8</xmin>
<ymin>312</ymin>
<xmax>34</xmax>
<ymax>351</ymax>
</bndbox>
</object>
</annotation>
\end{lstlisting}
\end{minipage}
%
%In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
%
%Have you visualized a relevant characteristic or feature about the dataset or input data?
%Is the visualization thoroughly analyzed and discussed?
%If a plot is provided, are the axes, title, and datum clearly defined?
\subsection*{Algorithms and Techniques}
%Probably we augment the training data to have more images per brand.
As mentioned before, we start by training an own CNN architecture to classify the logos as done in the dog breeds project of the Nanodegree. For this, 40\% of the dataset forms the test set, and the remaining data is split into 80\% training set and 20\% validation set. Using Keras preprocessing, each image is converted into a 4D tensor with shape $(1,224,224,3)$.
This CNN consists of a convolutionary layer, a max-pooling layer, another convolutionary layer, a max-pooling layer, a global-average pooling layer and a final fully connected layer.
The theory behind selecting these layer is that at first nodes in \emph{convolutionary layers} help to detect patterns in single small regions in the image by "filtering" images that are interpreted as 3D arrays (height, width and color).
In order to reduce overfitting caused by the high-dimensionality of the filter stack in the convolutionary layers we further use the following two types of \emph{pooling layers}:
Nodes in the first type \emph{max-pooling layer} contain the maximum value of the corresponding region in the filter stack of the previous layer.
\emph{Global average pooling layers} reduce dimensionality more drastically by taking the average of the entries of each of the three 2D arrays, which constitute the image as 3D array, to produce a 3D vector.
As gradient descent optimizer algorithm we choose "RMSprop", as loss function for our categorization problem "categorical cross entropy" and as metric "accuracy".
Using Keras ModelCheckpoint, we save the model with the best validation loss.
As a next step we train a popular CNN like VGG19 whose weights are pre-trained on ImageNet. We proceed similarly as in the first setting.
Having tried CNNs for logo classification, we finally proceed to Faster R-CNNs for logo detection and classification.
We first train a \emph{Region Proposal Network (RPN)} that proposes regions with logos in images. The predicted region proposals are then reshaped using a \emph{Region of Interest (RoI)} pooling layer. This layer is next used to classify the image within the proposed region and predict the offset values for the bounding boxes. For the latter task, we again train a CNN like VGG19 that is pre-trained on ImageNet.
As explained above, here we use mean average precision as metric.
\subsection*{Benchmark}
The recent \emph{Logos in the Wild} dataset has not been studied much yet. When introduced in \cite{logosinthewild}, the focus is put on open set logo retrieval where only one sample image of a logo is available.
Instead we want to focus on a closed world assumption where we train and test on the Logos in the Wild dataset which has multiple images per brand.
Therefore, we can only compare our results with the ones of other models that were trained and tested on the popular closed dataset \emph{FlickrLogos-32} \cite{flickrlogos32}. As cited in \cite{logosinthewild}, the mean average precision (map) in state-of-the-art results is 0.811 achieved by Faster-RCNN \cite{DBLP:journals/corr/SuZG16} where the training set is expanded with synthetically generated training images, and 0.842 using
Fast-M \cite{Bao:2016:RCL:3007669.3007728} that is a multi-scale Fast R-CNN based-approach.
\section{Methodology}
%(approx. 3-5 pages)
\subsection*{Data Preprocessing}
\label{dataPreprocessing}
Having gained access, we download the Logos in the Wild dataset \cite{logosinthewild} that contains XML files with the bounding boxes, the URLs to the JPEG images, samples and a script to obtain a clean dataset.
As stated above, we obtain the JPEG images of the Logos in the Wild dataset by downloading the superset QMUL-OpenLogo Dataset \cite{DBLP:journals/corr/abs-1807-01964}.
After setting the variable "oldpath" to the absolute path of our \path{LogosInTheWild-v2/data} directory and the variable "new path" to the absolute path of our \path{openlogo/JPEGImages} directoy in our script move\_JPEG\_images.sh from the Logo Capstone Project, we execute this script that moves the JPEG files from the \path{openlogo/JPEGImages} directory in the corresponding brand subdirectory in the \path{LogosInTheWild-v2/data} directory.
Next we remove the \path{0samples} folder from the \path{LogosInTheWild-v2/data} directory, and execute in a separate Conda environment with Python 2.7 and opencv-python%the Python script \path{create\_clean\_dataset.py} from the \path{LogosInTheWild-v2/scripts} directory
\begin{lstlisting}[language=bash]
# From LogosInTheWild-v2/scripts directory
python create_clean_dataset.py --roi --in ../data --out ../cleaned-data
\end{lstlisting}
that adjusts the brand names in the XML files and removes XML files without corresponding JPEG image.
This outputs that 9,428 images and 821 brands were processed, while 1,330 JPEG files were unavailable. In addition, it created 28,007 ROI images in corresponding folders with logo names.
For our first logo classification task with a standard CNN architecture in Keras and Python 3.6, we can load the image files with the \path{load_files} function from \path{sklearn.datasets} since the Logos in the Wild dataset is organized in directories with corresponding brand names. Of course, some images like the example in \autoref{sampleImage} contain logos of different brands so labelling them with the brand directory name may decrease accuracy. Furthermore, we use the \path{image.load_img} and \path{image.img_to_array} functions from \path{keras.preprocessing} to convert the JPEG files to Keras' required 4D tensor format with shape $(1,224,224,3)$. As last preprocessing step for the first standard CNN architecture, we rescale the images by dividing every pixel in each image by 255.
Finally, for training a Faster R-CNN with Tensorflow's Object Detection API we need to convert the Logos in the Wild dataset with annotations in Pascal-VOC style to Tensorflow's TFRecord file format. For this, we copy the Python script \path{create_and_analyze_pascal_tf_record.py} from the Capstone project folder to the LogosInTheWild-v2 directory, rename the folder \path{LogosInTheWild-v2/cleaned_data} to \path{LogosInTheWild-v2/data}, enter "export PYTHONPATH=\$PYTHONPATH:`pwd`:`pwd`/slim
" when in the \path{tensorflow/models/research} directory and run the following:
\begin{lstlisting}[language=bash]
# From LogosInTheWild-v2 directory
python create_and_analyze_pascal_tf_record.py --data_dir=./data/voc_format --label_map_path=./data/pascal_label_map.pbtxt --output_path=./data/
\end{lstlisting}
After completion we see that we converted 6,034 images with annotations into the 10 training TFRecord files \path{pascal_train.record-0000i-of-00010}, and 1,508 images with annotations into the 10 validation TFRecord files \path{pascal_val.record-0000i-of-00010} for $i=0,\ldots,9$.
Further, in \path{LogosInTheWild-v2/data} this created \path{test_images.txt} containing the absolute path of the 1,886 images in the test set, and \path{pascal_label_map.pbtxt} containing 821 entries like the following:
\begin{lstlisting}[language=Python]
item {
id: 262
name: 'starbucks-text'
}
\end{lstlisting}
Among other parameters for our Faster R-CNN, we choose in our pipeline configuration file \path{faster_rcnn_inception_logos-locally-on-ubuntu.config} by the Capstone Project as data\_augmentation\_options random\_horizontal\_flip, random\_vertical\_flip and random\_rotation90.
\subsection*{Implementation}
\subsubsection*{A first CNN model from scratch in Keras \cite{chollet2015keras}}
For the logo classification task, we start to implement a standard Keras CNN architecture in the Jupyter notebook \path{logos.ipynb} located in the Capstone Project repository. Keras' model summary is as follows:
%\begin{table}[htb!]
%\centering
%\caption{Architecture of VGG-like CNN from Keras\cite{chollet2015keras}.}
% \begin{tabular}{ccc}
% \toprule
% Layer (type) & Output Shape & Param \# \\
% \midrule
% conv2d\_1 (Conv2D) & (None, 14, 14, 32) & 320\\
% conv2d\_2 (Conv2D) & (None, 12, 12, 32) & 9248\\
% max\_pooling2d\_1 (MaxPooling2) & (None, 6, 6, 32) & 0\\
% dropout\_1 (Dropout) & (None, 6, 6, 32) & 0\\
% conv2d\_3 (Conv2D) & (None, 4, 4, 64) & 18496\\
% conv2d\_4 (Conv2D) & (None, 2, 2, 64) & 36928\\
% max\_pooling2d\_2 (MaxPooling2) & (None, 1, 1, 64) & 0\\
% dropout\_2 (Dropout) & (None, 1, 1, 64) & 0\\
% flatten\_1 (Flatten) & (None, 64) & 0\\
% dense\_1 (Dense) & (None, 256) & 16640\\
% dropout\_3 (Dropout) & (None, 256) & 0\\
% dense\_2 (Dense) & (None, 10) & 2570\\
% \bottomrule
% \end{tabular}\\
% \label{table:vgg}
%\end{table}
%
\begin{longtable}{ccc}%[htb!]
%\centering
\caption{Architecture of a first CNN for logo classification from Keras.}\\
\hline
Layer (type) & Output Shape & Param \# \\
\hline
\endfirsthead
\multicolumn{3}{c}
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
% \toprule
%\hline
% Layer (type) & Output Shape & Param \# \\
% \hline
\endhead
% \hline
\multicolumn{3}{r}{\textit{Continued on next page}} \\
\endfoot
% \midrule
\hline\endlastfoot
conv2d\_3 (Conv2D) & (None, 223, 223, 32) & 416 \\
max\_pooling2d\_4 (MaxPooling2) & (None, 111, 111, 32) & 0 \\
conv2d\_4 (Conv2D) & (None, 110, 110, 64) & 8256 \\
max\_pooling2d\_5 (MaxPooling2) & (None, 55, 55, 64) & 0 \\
global\_average\_pooling2d\_2 (GAP2D\footnotemark)& (None, 64)& 0 \\
dense\_2 (Dense) & (None, 109) & 7085 %\\
% \bottomrule
% \end{multicolumn}\\
\label{table:modelFromScratch}
\end{longtable}
\footnotetext{GlobalAveragePooling2D}
This model has in total 15,757 parameters that are all trainable. We compile the model with optimizer 'rmsprop' and loss 'categorical\_crossentropy', and choose as metric 'accuracy'. For training, we set epochs to 100, validation\_split to 0.3 and define a \path{ModelCheckpoint} from \path{keras.callbacks} that saves the model with the best validation loss.
After training and loading the model with the best weights, we compute the accuracy on the test images:
\begin{lstlisting}[language=Python]
print('\n', 'Test accuracy of the model from scratch:', model_from_scratch.evaluate(test_tensors, test_targets, verbose=0)[1]*100,'%.')
\end{lstlisting}
After training for 20 epochs, this results in "Test accuracy of the model from scratch: 16.76 \%". Finally, continuing to train for 600 further epochs leads to a test accuracy of 33.99 \% where the validation loss improved for the last time in epoch 594 to 2.95990.
In the next section, we describe how we fine-tune our first model.
\subsubsection*{A Custom Model on top of the VGG16 model \cite{DBLP:journals/corr/SimonyanZ14a} in Keras}
Now we build a custom Keras CNN model on top of the well-known VGG16 architecture:
\begin{lstlisting}[language=Python]
from keras.applications.vgg16 import VGG16
VGG16_model = VGG16(include_top=True, weights='imagenet', input_shape=train_tensors.shape[1:])
\end{lstlisting}
Let us first summarize the VGG16 model:
\begin{longtable}{ccc}%[htb!]
\caption{Architecture of the VGG16 model from Keras.}\\
\hline
Layer (type) & Output Shape & Param \# \\
\hline
\endfirsthead
\multicolumn{3}{c}
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
% \toprule
%\hline
% Layer (type) & Output Shape & Param \# \\
% \hline
\endhead
% \hline
\multicolumn{3}{r}{\textit{Continued on next page}} \\
\endfoot
% \midrule
\hline\endlastfoot
input\_2 (InputLayer) &(None, 224, 224, 3) &0 \\
block1\_conv1 (Conv2D) &(None, 224, 224, 64) &1792 \\
block1\_conv2 (Conv2D) &(None, 224, 224, 64) &36928 \\
block1\_pool (MaxPooling2D) &(None, 112, 112, 64) &0 \\
block2\_conv1 (Conv2D) &(None, 112, 112, 128) &73856 \\
block2\_conv2 (Conv2D) &(None, 112, 112, 128) &147584 \\
block2\_pool (MaxPooling2D) &(None, 56, 56, 128) &0 \\
block3\_conv1 (Conv2D) &(None, 56, 56, 256) &295168 \\
block3\_conv2 (Conv2D) &(None, 56, 56, 256) &590080 \\
block3\_conv3 (Conv2D) &(None, 56, 56, 256) &590080 \\
block3\_pool (MaxPooling2D) &(None, 28, 28, 256) &0 \\
block4\_conv1 (Conv2D) &(None, 28, 28, 512) &1180160 \\
block4\_conv2 (Conv2D) &(None, 28, 28, 512) &2359808 \\
block4\_conv3 (Conv2D) &(None, 28, 28, 512) &2359808 \\
block4\_pool (MaxPooling2D) &(None, 14, 14, 512) &0 \\
block5\_conv1 (Conv2D) &(None, 14, 14, 512) &2359808 \\
block5\_conv2 (Conv2D) &(None, 14, 14, 512) &2359808 \\
block5\_conv3 (Conv2D) &(None, 14, 14, 512) &2359808 \\
block5\_pool (MaxPooling2D) &(None, 7, 7, 512) &0 \\
flatten (Flatten) &(None, 25088) &0 \\
fc1 (Dense) &(None, 4096) &102764544 \\
fc2 (Dense) &(None, 4096) &16781312 \\
predictions (Dense) &(None, 1000) &4097000
\label{table:vgg}
\end{longtable}
This model has a large number of 138,357,544 parameters that are all trainable.
On top of this VGG16 model, we add some custom layers as follows:
\begin{longtable}{ccc}%[htb!]
\caption{Our custom Architecture on top of the above VGG16 model.}\\
\hline
Layer (type) & Output Shape & Param \# \\
\hline
\endfirsthead
\multicolumn{3}{c}
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
% \toprule
%\hline
% Layer (type) & Output Shape & Param \# \\
% \hline
\endhead
% \hline
\multicolumn{3}{r}{\textit{Continued on next page}} \\
\endfoot
% \midrule
\hline\endlastfoot
input\_2 (InputLayer) &(None, 224, 224, 3) &0 \\
block1\_conv1 (Conv2D) &(None, 224, 224, 64) &1792 \\
block1\_conv2 (Conv2D) &(None, 224, 224, 64) &36928 \\
block1\_pool (MaxPooling2D) &(None, 112, 112, 64) &0 \\
block2\_conv1 (Conv2D) &(None, 112, 112, 128) &73856 \\
block2\_conv2 (Conv2D) &(None, 112, 112, 128) &147584 \\
block2\_pool (MaxPooling2D) &(None, 56, 56, 128) &0 \\
block3\_conv1 (Conv2D) &(None, 56, 56, 256) &295168 \\
block3\_conv2 (Conv2D) &(None, 56, 56, 256) &590080 \\
block3\_conv3 (Conv2D) &(None, 56, 56, 256) &590080 \\
block3\_pool (MaxPooling2D) &(None, 28, 28, 256) &0 \\
block4\_conv1 (Conv2D) &(None, 28, 28, 512) &1180160 \\
block4\_conv2 (Conv2D) &(None, 28, 28, 512) &2359808 \\
block4\_conv3 (Conv2D) &(None, 28, 28, 512) &2359808 \\
block4\_pool (MaxPooling2D) &(None, 14, 14, 512) &0 \\
block5\_conv1 (Conv2D) &(None, 14, 14, 512) &2359808 \\
block5\_conv2 (Conv2D) &(None, 14, 14, 512) &2359808 \\
block5\_conv3 (Conv2D) &(None, 14, 14, 512) &2359808 \\
block5\_pool (MaxPooling2D) &(None, 7, 7, 512) &0 \\
conv2d\_3 (Conv2D) &(None, 5, 5, 64) &294976 \\
max\_pooling2d\_4 (MaxPooling2D)&(None, 2, 2, 64) &0 \\
global\_average\_pooling2d\_2 (GAP2D\footnotemark)&(None, 64)&0 \\
dropout\_1 (Dropout) &(None, 64) &0 \\
dense\_2 (Dense) &(None, 109) &7085
\label{table:vgg_custom}
\end{longtable}
\footnotetext{GlobalAveragePooling2D}
Since we only set our custom layers as trainable, out of the 15,016,749 total parameters, only 302,061 are trainable.
After training for 20 epochs, this results in a very good test accuracy of 36.80 \%". However, when continuing to train the custom model for 600 further epochs, the training loss kept decreasing but the validation loss increased. This indicates overfitting, and since after 20 epochs still no weights with an improved validation loss were achieved, we then stopped the training.
\subsubsection*{A custom Faster R-CNN with Tensorflow's Object Detection API \cite{DBLP:journals/corr/HuangRSZKFFWSG016}}
Finally, we want to improve our model by using Tensorflow's Object Detection API. After the necessary preprocessing steps described in \autoref{dataPreprocessing}, we define our Faster R-CNN with the following configuration file \path{faster_rcnn_inception_logos-locally-on-ubuntu.config} that is located in the Logo Capstone Repository.
\begin{minipage}{.45\textwidth}
\begin{lstlisting}[basicstyle=\fontsize{7.5}{8}\ttfamily,language=Python]
model {
faster_rcnn {
num_classes: 821
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_inception_v2'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
\end{lstlisting}
\end{minipage}\hfill
\noindent\begin{minipage}{.45\textwidth}
\begin{lstlisting}[basicstyle=\fontsize{7.5}{8}\ttfamily,language=Python,firstnumber=71]
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0002
schedule {
step: 900000
learning_rate: .00002
}
schedule {
step: 1200000
learning_rate: .000002
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
num_steps: 5000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_vertical_flip {
}
}
data_augmentation_options {
random_rotation90 {
}
}
}
train_input_reader {
label_map_path: "PATH_TO/LogosInTheWild-v2/data/pascal_label_map.pbtxt"
tf_record_input_reader {
input_path:"PATH_TO/LogosInTheWild-v2/data/pascal_train.record-?????-of-00010"
}
}
eval_config {
num_examples: 1886
max_evals: 1886
#use_moving_averages: false
metrics_set: "pascal_voc_detection_metrics"
}
eval_input_reader {
label_map_path: "PATH_TO/LogosInTheWild-v2/data/pascal_label_map.pbtxt"
shuffle: false
num_readers: 10
tf_record_input_reader {
input_path: "PATH_TO/LogosInTheWild-v2/data/pascal_val.record-?????-of-00010"
}
}
\end{lstlisting}
\end{minipage}
% #fine_tune_checkpoint: "PATH_TO/LogosInTheWild-v2/models/model/model.ckpt"
% #from_detection_checkpoint: true
% #load_all_detection_checkpoint_vars: true
We make sure that the directory structure is as recommended in \href{https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md}{running\_locally.md}:
\begin{verbatim}
+ "data" directory
- pascal_label_map.pbtxt
- 10 files pascal_train.record-0000i-of-00010 for i = 0,...,9
- 10 files pascal_val.record-0000i-of-00010 i = 0,...,9
+ "models" directory
+ "model" directory
- faster_rcnn_inception_logos-locally-on-ubuntu.config
+ "train" directory
+ "eval" directory,
\end{verbatim}
and train our model by running the following Python script:
\begin{lstlisting}[language=bash]
# From tensorflow/models/research directory
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
python object_detection/model_main.py \
--model_dir=PATH_TO/LogosInTheWild-v2/models/model/ \
--pipeline_config_path=PATH_TO/LogosInTheWild-v2/models/model/faster_rcnn_inception_logos-locally-on-ubuntu.config \
--num_train_steps=50000 --alsologtostderr
\end{lstlisting}
We monitor statistics with
\begin{lstlisting}[language=bash]
tensorboard --logdir=PATH_TO/LogosInTheWild-v2/models/model/.
\end{lstlisting}
After training finishes, we can export our model checkpoint:
\begin{lstlisting}[language=bash]
# From tensorflow/models/research/ directory
CHECKPOINT_NUMBER= Number from "model.ckpt-${CHECKPOINT_NUMBER}.meta"
python object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=PATH_TO/LogosInTheWild-v2/models/model/faster_rcnn_inception_logos-locally-on-ubuntu.config \
--trained_checkpoint_prefix=PATH_TO/LogosInTheWild-v2/models/model/model.ckpt-${CHECKPOINT_NUMBER} \
--output_directory=PATH_TO/LogosInTheWild-v2/export
\end{lstlisting}
%
By opening \path{Logo Capstone Project/adjusted_object_detection_tutorial.ipynb} in a jupyter notebook, adjusting the paths and executing the cells we can explore the results on the sample images.
Unfortunately, neither when evaluating nor when making predictions with the frozen model any logos are detected (see \url{https://github.com/tensorflow/models/issues/6748}).
To overcome this problem, we trained with only four logo classes instead of the total 821 classes but still no objects were detected. We also tried different conda environments: with Python 2.7, 3.6 or 3.7, on Windows 10, Ubuntu 18.04, Ubuntu 18.10 and even training on Google Cloud as described \href{https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md}{here}. Also using \path{tensorflow/models/research/object_detection/legacy/train.py} and \path{eval.py} instead of \path{model_main.py} for training and evaluating did not lead to any detections. At first, we used a label map that was created by scratch but the detections did not improve with an automatically created label map. Furthermore, different pipeline configs did not lead to object detections either.
\subsection*{Refinement}
As a first step for all our Keras models, we use Keras Data Augmentation to augment the training data. This is important as many logo classes do only contain few images.
\begin{lstlisting}[basicstyle=\fontsize{7.5}{8}\ttfamily,language=Python]
from keras.preprocessing.image import ImageDataGenerator
datagen_train = ImageDataGenerator(
width_shift_range=0.1, # randomly shift images horizontally (10% of total width)
height_shift_range=0.1, # randomly shift images vertically (10% of total height)
horizontal_flip=True, # randomly flip images horizontally
vertical_flip=True, # randomly flip images vertically
rotation_range=90) #randomly rotate images by up to 90 degrees
datagen_valid = ImageDataGenerator(
width_shift_range=0.1, # randomly shift images horizontally (10% of total width)
height_shift_range=0.1, # randomly shift images vertically (10% of total height)
horizontal_flip=True, # randomly flip images horizontally
vertical_flip=True, # randomly flip images vertically
rotation_range=90) #randomly rotate images by up to 90 degrees
datagen_train.fit(train_tensors)
datagen_valid.fit(val_tensors)
\end{lstlisting}
We visualize how one augmented image looks for 12 images from the training set:
\begin{figure}[h!]
\includegraphics[width=\textwidth]{original_images.jpg}
%\label{originalImages}
%\end{figure}
%\begin{figure}[h!]
\includegraphics[width=\textwidth]{augmented_images.jpg}
\label{augmentedImages}
\end{figure}
\vspace{-1em}
At first, we state that the test accuracy is 28.79 \% after continuing to train our model from scratch from \autoref{table:modelFromScratch} on the training set with augmented images for 50 epochs. The validation accuracy at epoch 50 is slightly higher than the one of epoch 50 when we trained only on the original images -- but of course the results are not really comparable as we started with pre-trained weights.
Similarly, we report that the test accuracy is 28.10 \% after continuing to train our custom model from \autoref{table:vgg_custom} on the training set with augmented images for 20 epochs. However, here we actually stopped training after 20 epochs as accuracy on the training set stays constantly around 17 \% and neither does training loss improve.
Since without augmented images the custom model overfits and with augmented images does not learn, we decide to finetune the model from \autoref{table:modelFromScratch}.
To achieve a higher complexity, we add further layers to this model as follows:
\begin{longtable}{ccc}%[htb!]
%\centering
\caption{Architecture of an extended CNN from scratch.}\\
\hline
Layer (type) & Output Shape & Param \# \\
\hline
\endfirsthead
\multicolumn{3}{c}
{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
% \toprule
%\hline
% Layer (type) & Output Shape & Param \# \\
% \hline
\endhead
% \hline
\multicolumn{3}{r}{\textit{Continued on next page}} \\
\endfoot
% \midrule
\hline\endlastfoot
conv2d\_1\_input (InputLayer) & (None, 224, 224, 3) & 0 \\
conv2d\_1 (Conv2D) & (None, 223, 223, 32) & 416 \\
max\_pooling2d\_1 (MaxPooling2) & (None, 111, 111, 32) & 0 \\
conv2d\_2 (Conv2D) & (None, 110, 110, 64) & 8256 \\
max\_pooling2d\_2 (MaxPooling2) & (None, 55, 55, 64) & 0 \\
conv2d\_4 (Conv2D) & (None, 53, 53, 32) & 18464 \\
max\_pooling2d\_4 (MaxPooling2) & (None, 26, 26, 32) & 0 \\
conv2d\_5 (Conv2D) & (None, 25, 25, 64) & 8256 \\
max\_pooling2d\_5 (MaxPooling2) & (None, 8, 8, 64) & 0 \\
dropout\_2 (Dropout) & (None, 8, 8, 64) & 0 \\
conv2d\_6 (Conv2D) & (None, 7, 7, 32) & 8224 \\
max\_pooling2d\_6 (MaxPooling2) & (None, 3, 3, 32) & 0 \\
conv2d\_7 (Conv2D) & (None, 2, 2, 64) & 8256 \\
global\_average\_pooling2d\_3 (GAP2D\footnotemark)&(None, 64) & 0 \\
dropout\_3 (Dropout) & (None, 64) & 0 \\
dense\_3 (Dense) & (None, 109) & 7085 %\\
% \bottomrule
% \end{multicolumn}\\
\label{table:extendedModelFromScratch}
\end{longtable}
\footnotetext{GlobalAveragePooling2D}
This model has 58,957 parameters that are all trainable.
After training for 50 epochs, this extended model reaches an accuracy of 26.03 \% on the test images.
Since Scikit-learn's GridSearchCV hangs after the fourth step when trying to fine-tune the parameters optimizer, activation function and batch size, we test some parameter combinations by hand.
We start by choosing optimizer "Adam" instead of "RMSprop", and a batch size of~1 instead of 32. The following epochs both training and validation loss increases while training and validation accuracy decreases. We note that this is similar to what happens when we train the custom model with the augmented images.
In the case of the custom model, we increase the batch size to 128, keep optimizer "RMSprop" and continue to train for 20 further epochs, which leads to an accuracy of 30.91 \% on the test set.
In the case of the extended model from scratch, we increase batch size to 64, keep otimizer "Adam" and continue to train for 30 further epochs, which leads to an accuracy of 29.85 \% on the test set.
Next, for the extendend model from scratch we choose a batch size to 64, otimizer "adagrad" and continue to train for 30 further epochs, which leads to an accuracy of 29.80 \% on the test set.
Finally, after 100 epochs we achieve a test accuracy of 30.91 \%.
Summing up, the test accuracies of the models are still very similar and the best accuracy seems to be reached when training for a longer time.
Since the results are very similar, for theoretical reasons we decide to choose the extended model from scratch with optimizer Adam.
More precisely, the authers \cite{DBLP:journals/corr/KingmaB14} designed "Adam" to combine the advantages of two optimizers: AdaGrad \cite{duchi2011adaptive} that works well with sparse gradients, and RMSProp \cite{tieleman2012lecture}.
However, when we continue to train this model for further 600 epochs, the weights with the best validation loss do not get updated after epoch 138 anymore so we interrupt training after 408 epochs. Loading the best weights from epoch 138 results in a test accuracy of 31.18 \%.
\section{Results}
\subsection*{Model Evaluation and Validation}
%
As most models performed similarly, we choose the extended model from scratch from \autoref{table:extendedModelFromScratch} as its architecture has a medium level of complexity, and the extended model did not overfit like the custom model on top of the VGG16 model.
Trained on the augmented images, the extended model from scratch shows litte sensitivity to changing the optimizer and batch size. From the training statistics, we get the impression that training it for more epochs would not lead to better results.
\begin{figure}[h!]
\includegraphics[width=\textwidth]{training_statistics.jpg}
%\caption[Bildbeschreibung]{\url{https://www.iosb.fraunhofer.de/servlet/is/78045/}}
\label{trainingStatistics}
\end{figure}
\subsection*{Justification}
Unfortunately, we did not see other research results for logo classification with metric accuracy to compare these to our results.
However, given that the dataset had many logo classes with only few images, we think that we cannot expect a high overall accuracy.
Still, we would like to reach an accuracy of at least 50 \% but we do not know which parameters of the different models from the previous sections, or which other CNN architectures could lead to a test accuracy higher than 40 \%.
On the other hand, our final model is robust and classifying small logos in larger images is not an easy task so our final result with a test accuracy of 30.91 \% is adequate.
\section{Conclusion}
\subsection*{Free-Form Visualization}
In this subsection, we visualize some logo classification results on ten test images:
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_100.png}
\label{prediction1}
\end{figure}
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_101.png}
\label{prediction1}
\end{figure}
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_102.png}
\label{prediction1}
\end{figure}
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_103.png}
\label{prediction1}
\end{figure}
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_104.png}
\label{prediction1}
\end{figure}
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_105.png}
\label{prediction1}
\end{figure}\\
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_106.png}
\label{prediction1}
\end{figure}\\
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_107.png}
\label{prediction1}
\end{figure}\\
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_108.png}
\label{prediction1}
\end{figure}\\
\begin{figure}[h!]
\includegraphics[width=\textwidth]{prediction_of_test_file_109.png}
We note that some logo classes like Volkswagen seem to have a better accuracy than the others.
\label{prediction1}
\end{figure}
\subsection*{Reflection}
There were several obstacles when training a CNN to classify or detect logos.
At first, the local hardware was not sufficient so often the training process freezes or progresses only slowly.
Also uploading the large dataset to Amazon Web Services or Google Cloud's bucket took very long.
Furthermore, when training a Faster R-CNN with Tensorflow's Object Detection API the error log was not sufficiently detailed to identify why there are no detections.
In addition to these technical limitations, further experience with CNNs is needed for adjusting the right parameters of the models. In the end, the very first two model architectures from \autoref{table:modelFromScratch} and \autoref{table:vgg_custom} already reach the best accuracy.
\subsection*{Improvement}
As a first crucial improvement, one could use Scikit-learn's from MultiLabelBinarizer to label the images with potentially multiple logo classes from the corresponding XML files. This would lead to 821 classes instead of 109 but the logo images may be more accurately classified.
How to train a CNN with multiple label classification is described \href{https://www.pyimagesearch.com/2018/05/07/multi-label-classification-with-keras/}{here}.
The accuracy may also be improved by (additionaly / pre-)training with the ROI images of the Logos in the Wild dataset, which are generated by the Python script \path{create\_clean\_dataset.py}. We did not choose this option as our goal is to classify logos in any image. By construction, CNNs are relatively translation invariant so that our goal seems to be within reach.
Moreover, due to time limits we did not train our models very long. More epochs could also lead to higher accuracy and to choosing a different model than we did.
Similarly, there are many more reasonable parameter combinations to explore than we did in this short time. For instance, using different activation functions, various filter sizes or decreasing learning rates.
Last, of course training an object detection model may lead to better results like the one mentioned in our benchmark section. We do not know why we could not detect any logos with our Faster R-CNN but it makes sense to still further investigate this.
%\nocite{*}
\bibliography{capstone_project}
\bibliographystyle{alpha}
\end{document}