-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4-pairclf.tex
906 lines (777 loc) · 58.1 KB
/
chapter4-pairclf.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
\begin{comment}
fixtex --reformat --fpaths chapter4-pairclf.tex --print
fixtex --fpaths chapter4-pairclf.tex --outline --asmarkdown --numlines=999
fixtex --fpaths chapter4-pairclf.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter4-pairclf.md
https://www.languagetool.org/
\end{comment}
\newcommand{\nan}{\text{nan}}
\chapter{PAIRWISE VERIFICATION}\label{chap:pairclf}
In this chapter we consider the problem of verifying if two annotations are from the same animal or from different
animals. By addressing this problem we improve upon the ranking algorithm from \cref{chap:ranking} --- which ranks
the \names{} in a database based on similarity to a query --- by making semi-automatic decisions about results
returned in the ranked lists. The algorithms introduced in this chapter will assign a confidence to results in the
ranked list, and any pair above a confidence threshold can be automatically reviewed. We will demonstrate that our
decision algorithms can significantly reduce the number of manual interactions required to identify all individuals
in an unlabeled set of annotations.
To make semi-automatic decisions up to a specified confidence we develop a \emph{pairwise probabilistic
classifier} that predicts a probability distribution over a set of events given two annotations (typically a
query annotation and one of its top results in a ranked list).
Given only the information in two annotations, there are three possible decisions that can be made.
A pair of annotations is either:
\begin{enumln}
\item incomparable --- the annotations are not visually comparable,
\item positive --- the annotations are visually comparable and the same individual, or
\item negative --- the annotations are visually comparable and different individuals.
\end{enumln}
Two annotations can be incomparable if the annotations show different parts or sides of an animal, or if the
distinguishing information on an animal is obscured or occluded.
The positive and negative states each require distinguishing information to be present.
These mutually exclusive ``match-states'' are illustrated in \cref{fig:MatchStateExample}.
The multi-label classifier then predicts the probability of each of the three states, with the probabilities
necessarily summing to $1$.
\MatchStateExample{}
To construct a pairwise probabilistic classifier we turn towards supervised machine learning.
This requires that we:
\begin{enumin}
\item determine a set of labeled annotation pairs for training,
\item construct a fixed-length feature vector to represent a pair of annotations, and
\item choose a probabilistic learning algorithm.
\end{enumin}
The first requirement can be satisfied by carefully selecting representative annotations pairs, and the last
requirement is satisfied by many pre-existing algorithms (\eg{} random forests and neural networks).
The second requirement --- constructing an appropriate fixed-length feature vector --- is the most challenging.
Given enough training data and a technique to align the animals in two annotations, using image data with a
Siamese or triplicate network~\cite{taigman_deepface_2014,schroff_facenet_2015} might be appropriate, but without
both of these pre-conditions we must turn towards more traditional methods.
Recall from \cref{sec:annotrepr} that our annotation representation is an unordered bag-of-features, which cannot
be directly fed to most learning algorithms.
Therefore, we develop a method for constructing a fixed length \glossterm{pairwise feature vector} for a pair of
annotations.
This novel feature vector will take into account local matching information as well as more global information
such as GPS and viewpoint.
A collection of these features from multiple labeled annotation pairs is used to fit a random
forest~\cite{breiman_random_2001} which implements our pairwise classifier.
We choose to use random forest classifiers, in part because they are fast to train, insensitive to feature
scaling, robust to overfitting, and naturally output probabilities in a multiclass setting, and in part because
they can handle (and potentially take advantage of) missing data --- \ie{} \nan{} values in feature vectors ---
using the ``separate class'' method~\cite{ding_investigation_2010}.
A final concern investigated in this chapter is the issue of image challenges that may confound the match-state
pairwise classifier.
Recall from~\cref{sub:exptfail}, {photobombs} --- pairs of annotations where feature correspondences are caused
by a secondary animal --- are the most notable cause of such a challenge.
Photobombs appear very similar to positive matches, and this similarity could confuse the match-state classifier.
However, because photobombs are inherently a pairwise property between annotations, it should be possible to
learn a separate classifier explicitly tasked with the challenge.
Therefore, we also learn a photobomb classifier using the same sort of pairwise feature vector and random forest
classifier.
This supporting classifier will allow us to increase the accuracy of our identification by restricting automatic
classification to pairs where the decision is straightforward.
This outline of this chapter is as follows.
\Cref{sec:pairfeat} details the construction of the feature vector that we use as input to the pairwise
classifier.
\Cref{sec:learnclf} describes the process of collecting training data and learning the match-state pairwise
classifier.
\Cref{sec:learnpb} extends this approach to predict secondary attributes (\eg{} if a pair is a photobomb) beyond
just the matching state.
\Cref{sec:pairexpt} presents a set of experiments that evaluate the pairwise classifier.
\Cref{sec:pairconclusion} summarizes and concludes this chapter.
\section{CONSTRUCTING THE PAIRWISE FEATURE VECTOR}\label{sec:pairfeat}
In order to use the random forest learning algorithm to address the problem of pairwise verification, we must
construct a feature vector to representing a pair of annotations.
This feature vector must contain enough information to differentiate between the positive, negative, and
incomparable classes.
In contrast to the unordered bag-of-features used to represent an annotation, the dimensions in this feature
vector must be ordered and each dimension should correspond to a specific measurement.
In practice this means that the feature vector must be ordered and have a fixed length.
We construct this feature vector to contain both global and local information.
Global information is high level metadata about the annotation pair and includes non-visual information.
The local information aggregates statistics about feature correspondences between the two annotations.
The local and global vectors are constructed separately and then concatenated to form the final pairwise feature
vector.
The remainder of this section discusses the construction of these vectors.
\subsection{The global feature vector}
The global feature vector contains information that will allow the classifier to take advantage of semantic
labels and non-visual attributes of our data to solve the verification problem.
Semantic labels such as quality and viewpoint are derived from visual information and can provide abstract
knowledge to help the classifier make a decision.
Non-visual attributes such as GPS and timestamp can be extracted from EXIF metadata and may help determine facts
not discernible from visual data alone.
The global feature vector is derived from the following ``unary'' attributes extracted from each annotation individually:
\begin{enumln}
\item Timestamp, represented in POSIX format as a float.
\item GPS latitude and longitude, represented in radians as two floats.
\item Viewpoint classification label, given as a categorical integer ranging from $1$ to $8$.
\item Quality classification label, given as a categorical integer ranging from $1$ to $5$.
\end{enumln}
We gather the GPS and timestamp attributes from image EXIF data, and the viewpoint and quality labels are outputs
of the deep classifiers previously discussed in \cref{subsec:introdataprocess}.
The GPS and timestamp attribute inform the classifier of when it is not possible for two annotations to match
(\eg{} when a pair of annotations is close in time but far in space) and when two annotations were taken around
the sample place and time.
Pairs of annotations taken around the same place and time tend to have a higher similarity and are more likely to
contain photobombs and scenery matches.
The viewpoint and quality attributes should help the classifier predict when pairs of annotations are not
comparable --- forcing there to be stronger evidence to form a match, such as strong correspondences on a face
turned toward the camera in both a left and right side view.
An example illustrating such a case --- where two annotations with different viewpoints are a positive match ---
is illustrated in \cref{fig:LeftRightFace}.
\LeftRightFace{}
These four ``unary'' attributes are gathered for each annotation.
Thus, for each attribute we have two measurements, one for each annotation.
However, we do not use them directly because the ordering of the annotations in each pair is arbitrary.
For each unary attribute, we either ignore it (as in the case of GPS and time) or record the minimum of the two
values in one feature dimension and the maximum in another (as is done with viewpoint and quality).
This results in $4$ unary measurements, $2$ for viewpoint and $2$ for quality.
These measurements are appended to the front of the global feature vector.
The remaining dimensions of the global feature vector are constructed by encoding relationships between pairs of
unary attributes using distance measurements.
In the case of GPS coordinates we use the haversine distance (as detailed in \cref{app:occurgroup}).
In the case of viewpoint we use a cyclic absolute difference --- \ie{} the distance between viewpoints $v_1$ and
$v_2$ is $\min(|v_1 - v_2|, 8 - |v_1 - v_2|)$.
For quality and time we simply use the absolute difference between their values.
This results in $4$ pairwise measurements, one for each global attribute.
Lastly, we include the ``speed'' of the pair, which is the GPS-distance divided by the time-delta.
Thus, the total number of global measurements is $4 + 4 + 1 = 9$.
In the event that an attribute is not provided or not known (\eg{} the EXIF data is missing), then a measurement
cannot be made, so a \nan{} value is recorded instead.
To apply random forests learning, these \nan{} values must be handled by either modifying the learning algorithm
or replacing them with a number.
Ding and Simonoff investigate several methods for handling missing data in~\cite{ding_investigation_2010}, and
they conclude that the best choice is application dependent.
For our application we choose the ``separate class'' method because their experiments demonstrate that it
performs the best when \nan{} values are in both the training and testing data, which is the case for our data.
In addition to being the best fit for our application, the separate class method is simple.
The idea is to replace all \nan{} measurements with either an extremely large or extremely small number.
The choice of large or small is made independently at each node in the decision tree, depending on which choice
is best.
In this way the \nan{} values are essentially treated as a separate category because a test can always be chosen
that separates the measured and unmeasured data.
This has several effects.
In the case that a \nan{} measurement in a feature dimension is informative (\eg{} if images without timestamps
are less likely to match other annotations), the random forest can take advantage of that dimension.
However, in the more likely case that the same \nan{} measurement is uninformative, the dimension can still be
used, but it is penalized proportional to the fraction of samples where it takes a \nan{} value.
This captures the idea that a feature dimension is less likely to be informative if it cannot be measured
reliably.
However, if that feature dimension is highly informative for samples where it has a numeric value, then a node in
a decision tree can still make use of it, and the samples with \nan{} values can be split by nodes deeper in the
tree.
\subsection{The local feature vector}
The local feature vector distills two orderless bag-of-features representations into a fixed length vector
containing matching information about a pair of annotations.
Three basic steps are needed to construct the local feature vector.
First we determine feature correspondences between the two annotations.
Then for each correspondence we make several measurements (\eg{} descriptor distance and spatial position).
Finally, we aggregate these measurements over all correspondences using summary statistics (\eg{} mean, sum,
std).
Later we augment this basic scheme by constructing multiple sets of feature correspondences.
Thus, the total length of the feature vector is the number of measurements times the number of summary statistics
times the number of ways feature correspondences are established.
\subsubsection{Establishing feature correspondences}
To determine feature correspondences between two annotations, $\qaid$ and $\daid$, we use what we refer to as a
one-vs-one matching algorithm.
Each annotation's descriptors are indexed for fast nearest neighbor search~\cite{muja_fast_2009}.
Keypoint correspondences are formed by searching for the reciprocal nearest neighbors between annotation
descriptors~\cite{qin_hello_2011}.
For each feature in each correspondence, the next nearest neighbor is used as a normalizer for Lowe's ratio
test~\cite{lowe_distinctive_2004}.
Because matching is symmetric, each feature correspondence is associated with two normalizing neighbors.
The feature / normalizer pair with the minimum descriptor distance is used as a normalizing pair.
If the ratio of the descriptor distance between a correspondence to the distance between the normalizing pair is
above a threshold, the correspondence is regarded as non-distinct and removed.
For the simplicity of the description we consider just one ratio threshold for now, but later we will describe
this process using multiple thresholds.
Spatial verification~\cite{philbin_object_2007} is applied to further refine the correspondences.
This one-vs-one matching algorithm results in a richer set of correspondences between annotations than would be
found using the LNBNN ranking algorithm.
\subsubsection{Local measurements}
After the one-vs-one matching stage, several measurements are made for each feature correspondence.
Before describing these measurements, it will be useful to set up some notation.
Given two annotations $\qaid$ and $\daid$, consider a feature correspondence between two features $i$ and $j$
with descriptors $\desc_i$ and $\desc_j$.
Let $\descnorm_i$ be the normalizer for $i$, and let $\descnorm_j$ be the normalizer for $j$.
Note that while $i$ is from $\qaid$, its normalizer, $\descnorm_i$, is a descriptor from $\daid$.
The converse is true for $j$.
Let $c \in \curly{i, j}$ indicate which feature / normalizer pair is used in the ratio test. %
Thus, $c = \argmin{c \in {i, j}}{\elltwo{\desc_c - \descnorm_c}}$.
Given these definitions, the measurements we consider are:
\begin{itemln}
\item Foregroundness score:
This is the geometric mean of the features' foregroundness measures, $\sqrt{w_i w_j}$.
This adds $1$ measurement, denoted as $\tt{fgweight}$, for each correspondence.
\item Correspondence distance:
This is the Euclidean distance between the corresponding descriptors, $\elltwo{\desc_i - \desc_j} / Z$.
This serves as a measure of visual similarity between the features.
(Recall $Z\tighteq\sqrt{2}$ for SIFT descriptors).
This adds $1$ measurement, denoted as $\tt{match\_dist}$, for each correspondence.
\item Normalizer distance:
This is the distance between a matching descriptor and the normalizing descriptor, %
$\elltwo{\desc_c - \descnorm_c} / Z$.
This serves as a measure of visual distinctiveness of the features.
We also include a weighted version of this measurement by multiplying it with the foregroundness score.
This adds $2$ measurements, denoted as $\tt{norm\_dist}$ and $\tt{weighted\_norm\_dist}$, for each
correspondence.
\item Ratio score:
This is the one minus the ratio of the correspondence distance to the normalizer distance, %
$1 - \elltwo{\desc_i - \desc_j} / \elltwo{\desc_c - \descnorm_c}$.
This combines both similarity and distinctiveness.
Note that this is one minus the measure used to filter correspondences in the ratio test.
We perform this subtraction in order to obtain a score that varies directly with distinctiveness --- \ie{}
the ratio score can increase by decreasing the visual difference or increasing the distinctiveness.
We also include a weighted version of this measurement by multiplying it with the foregroundness score.
This adds $2$ measurements, denoted as $\tt{ratio\_score}$ and $\tt{weighted\_ratio\_score}$, for each
correspondence.
\item Spatial verification error:
This is the error in location, scale, and orientation, $( %
\Delta \pt_{i, j}, %
\Delta \scale_{i, j}, %
\Delta \ori_{i, j})$, as measured using~\cref{eqn:inlierdelta}.
This adds $3$ measurements, denoted as $\tt{sver\_err\_xy}$, $\tt{sver\_err\_scale}$, and
$\tt{sver\_err\_ori}$, for each correspondence.
These measurements carry information about the alignment between the feature correspondences.
\item Keypoint relative locations:
These are the xy-locations of the keypoints divided by the width and height of the annotation %
$x_i / w_{\qaid}, y_i / h_{\qaid}$ and $x_i / w_{\daid}, y_j / h_{\daid}$.
These measurements carry information about the spatial distribution of the feature correspondences.
This will be useful for determining if a pair of annotations is a photobomb because often the
photobombing animal is not in the center of the annotation.
This adds $4$ measurements, denoted as $\tt{norm\_x1}$, $\tt{norm\_y1}$, $\tt{norm\_x2}$, and
$\tt{norm\_y2}$, for each correspondence.
Note that unlike the global quality and viewpoint measures, we do not make an effort to account for the
arbitrary ordering of annotations when recording these local features.
This is to preserve the association between the spatial dimensions of each annotation.
The same is true for the next feature.
\item Keypoint scales:
These are the keypoint scale parameters $\sigma_i$ and $\sigma_j$, as measured
using~\cref{eqn:affinewarp}.
The scales indicate the size of each keypoint with respect to its annotation.
This may be useful for disregarding matches between large coarsely described features that do not carry a
significant amount of individually distinctive information.
This adds $2$ measurements, denoted as $\tt{scale1}$ and $\tt{scale2}$, for each correspondence.
\end{itemln}
\subsubsection{Summary statistics}
Once these $15$ measurements have been made for each keypoint correspondence we summarize them using summary
statistics.
We consider the sum, median, mean, and standard deviation over all correspondences.
%We consider the sum, inverse-sum, mean, and standard deviation over all correspondences.
We have also considered taking values from individual correspondences based on rankings and percentiles with
respect to a particular attribute (\eg{} ratio score), however we found that these did not improve the
performance of our classifiers.
In practice the summary statistics work quite well.
The resulting measurements are stacked to form the local feature vector.
This results in $15 \times 4 = 60$ measurements.
A final step we have found useful is to append an extra dimension simply indicating the total number of feature
correspondences.
So, in total there are $61$ summary statistics computed for a set of feature correspondences.
\subsubsection{Multiple ratio thresholds}
As previously noted we establish multiple groups of feature correspondences for different threshold values of the
ratio test.
We do this because we have observed that some positive annotation pairs had all of their correspondences filtered
by the ratio test.
However, when we increase the ratio threshold, the overall classification performance decreases.
Therefore, we include both small and large thresholds to allow the classifier to have access to both types of
information.
Including larger threshold values helps ensure that most pairs generate at least a few correspondences, while
smaller threshold values capture the information in highly distinctive correspondences.
Overall, this softens the impact of the ratio test's binary threshold and adds robustness to viewpoint and pose
variations that may cause correspondences to appear slightly less distinctive.
The details of this process are as follows:
Once we have assigned feature correspondences using symmetric nearest neighbors, we create a group of
correspondences for each ratio value.
The members of each group are all correspondences with a ratio score less than that value.
Each group is then spatially verified, and the union of the groups is the final set of correspondences.
When measuring spatial verification errors, each keypoint may be associated with multiple values.
Therefore, we use the minimum spatial verification error over all values of the ratio threshold.
The local feature vector is constructed by applying summary statistics to each of these groups independently and
then concatenating the results.
Thus, the size of the local feature vector is multiplied by the number of ratio thresholds used.
While using multiple values of the ratio test can further enrich the pairwise local feature vector, there are two
trade-offs that must be taken into account.
First, spatial verification must be run multiple times, which noticeably increases computation time.
Second, the resulting size of the feature vector is larger, which can make learning more difficult due to the
curse of dimensionality.
Therefore, in our implementation we choose to use only two ratio threshold values of $0.625$ and $0.8$.
Thus, the total number of local measurements is $2 \times 61 = 122$.
\subsubsection{Additional notes}
We have found that, for some species like plains zebras, it is important to use the keypoint orientation
heuristic described in \cref{sec:annotrepr} when computing one-vs-one matches.
This heuristic causes each keypoint to extract $3$ descriptors instead of $1$.
In this case we should not use the second nearest neighbor as the normalizer for the ratio test, because the
augmented keypoints may have similar descriptors.
We account for this by using the $3\rd{}$ nearest neighbor as the normalizer instead.
% NOTE: Old way.
%Once we have assigned feature correspondences, we filter these correspondences using the ratio test with the
%maximum value of the ratio threshold.
%Note that these threshold points are with respect to the original ratio values, and not the ratio scores used in
%the feature vector.
%Then the remaining feature correspondences are spatially verified as normal.
%At this point, we create a group of correspondences for each value of the ratio threshold.
%The member of each group are all correspondences with a ratio score less than that value.
\FloatBarrier{}
\subsection{The final pairwise feature vector}
The final pairwise feature vector is constructed by concatenating the local and the global vector.
This results in a $131$ dimensional vector containing information that a learning algorithm can use to predict if
a pair of annotations is positive, negative, or incomparable.
Of these dimensions, $9$ are from global measurements and $122$ are from local measurements.
The example in~\cref{fig:PairFeatVec} illustrates part of a final feature vector.
%\PairFeatVec{}
\begin{figure}
\begin{minted}[gobble=4]{python}
OrderedDict([('global(qual_min)', 3),
('global(qual_max)', nan),
('global(qual_delta)', nan),
('global(gps_delta)', 5.79),
('len(matches[ratio < .8])', 20),
('sum(ratio_score[ratio < .8])', 10.05),
('mean(ratio_score[ratio < .8])', 0.50),
('std(ratio_score[ratio < .8])', 0.09)])
\end{minted}
\captext[\caplbl{PairFeatVec}A pairwise feature vector]{
% ---
This is an example of a small pairwise feature vector containing local and global information.
The feature vector we use in practice contains $131$ dimensions.
Note the summary statistics in this example are all computed for correspondences with a ratio value that is less
than $0.8$.
% ---
}
\label{fig:PairFeatVec}
\end{figure}
\section{LEARNING THE MATCH-STATE CLASSIFIER}\label{sec:learnclf}
Having defined the pairwise feature vector there are two remaining steps to constructing the pairwise
classifier.
We must:
(1) choose a probabilistic learning algorithm, and
(2) select a sample of labeled training data.
We have previously stated that we will use the random forest~\cite{breiman_random_2001} as our probabilistic
classifier.
Therefore, we briefly review the details of random forest learning and prediction.
Then we describe the sampling procedure used to generate a dataset of labeled annotation pairs.
\subsection{The random forest learning algorithm}
The random forest learning algorithm~\cite{breiman_random_2001} is well understood, so we only provide a
brief overview.
A random forest is constructed by learning a forest of decision trees.
Learning begins by initializing each decision tree as a single node.
Each root node is assigned a random sample of the training data with replacement, and then a recursive
algorithm is used to grow each root node into a decision tree.
Each iteration of the recursive algorithm is given a leaf node, and will choose a test to split the training
data at the node into two child nodes.
The test is constructed by first choosing a random subset of feature dimensions.
We then find choose a dimension and threshold to maximize the decrease in class-label entropy.
Note that when using the ``separate class'' method, the algorithm tests placing samples with missing data on
both the left and right side of the split.
The algorithm is then recursively executed on the right and left node until a leaf is assigned fewer than a
minimum number of training examples.
To select a test for a node, the number of candidate features dimensions we choose is the square root of the
total number of feature dimensions.
Each decision tree predicts a probability distribution over all classes for a testing example by descending
the tree, choosing left or right based on the test chosen at the node until it reaches a leaf node.
The predicted probabilities are the proportions of training class-labels at that leaf node.
The probability prediction of the random forest is the average of the probabilities predicted by all decision
trees.
We use the random forest implementation provided by Scikit Learn~\cite{pedregosa_scikit_learn_2011}.
To choose hyper-parameters, we preform a grid search.
We find that it works best to use $256$ decision trees, and to stop branch growth once a leaf node contains
$5$ or fewer training samples.
For other parameters we use the implementation defaults.
\subsection{Sampling a labeled dataset of annotation pairs}
Now that we have chosen a learning algorithm, the last remaining step is the selection of training data and
generation of labels.
Recall that the purpose of our classifier is to output a probability distribution over three labels:
positive, negative and incomparable.
Given a pair of annotations we need to assign one of these three labels using ground truth data.
In recent versions of our system, this ground truth label is stored along with each unordered pair of
annotations that has been manually reviewed, but because this is a new feature, only a few pairs have been
assigned an explicit three-state label.
Therefore, we must make use of the name and viewpoint labels assigned to each annotation by previous versions
of the system.
This allows us to determine if an annotation pair shows the same or different animals, but it does not allow
us to determine if the pair is comparable.
To account for this we use heuristics to assign the incomparable label using the viewpoints, and if either
annotation is not assigned a viewpoint it is assumed that they are comparable because most images in our
datasets are taken from a consistent viewpoint (\ie{} collection events were designed to reduce
incomparability).
Given a pair of annotations, a training label is assigned as follows.
First, if an explicit three-state label exists, return it.
Otherwise, we must heuristically decide if the pair is comparable based on viewpoint information.
If the heuristics determine that a pair is not comparable, then return incomparable.
In all other cases return positive if the annotations share a name label and negative if they do not.
In order to select pairs from our ground truth dataset, we sample representative pairs of annotations guided
by the principle of selecting examples that exhibit a range of difficulties~\cite{shi_embedding_2016} (\eg
hard-negatives and moderate positives).
We use the LNBNN ranking algorithm to estimate how easy or difficult it might be to predict the match-state
of a pair.
Pairs with higher LNBNN scores will be easier to classify for positive examples and will be more difficult
for negative examples, and lower scores will have the reverse effect.
Specifically, to sample a dataset for learning, we first rank the database for each query image using the
ranking algorithm.
We partition the ranked lists into two parts:
a list of correct matches and a list of incorrect matches.
We select annotations randomly from the top, middle, and bottom of each list.
For positive examples we select $4$ from the top, $2$ from the middle, $2$ from the bottom, and $2$ randomly.
For negative examples we select $3$ from the top, $2$ from the middle, $1$ from the bottom, and $2$ randomly.
If there are not enough examples to do this, then all are taken.
We include all pairs explicitly labeled as incomparable because there are only a few such examples.
If this was not the case, then we would include an additional partition for incomparable cases.
\section{SECONDARY CLASSIFIER TO ADDRESS PHOTOBOMBS}\label{sec:learnpb}
It is useful to augment the primary match-state pairwise classifier with a secondary classifier able to
determine if a pair of annotations contains information that might confuse the main classifier.
These confusing annotation pairs should not be considered for automatic review.
One of the most challenging of these secondary states is one that we refer to as a {photobomb}.
A pair of annotations is a photobomb if a secondary animal in one annotation matches an animal in another
annotation (\eg see~\cref{fig:PhotobombExample}).
Only the primary animal in each annotation should be used to determine identity, but photobombs provide
strong evidence of matching that can confuse a verification algorithm.
\PhotobombExample{}
\FloatBarrier{}
During events like the \GZC{} we labeled several annotation pairs as photobombs.
Using these labels we will construct a classifier in the same way that we constructed the primary match-state
classifier.
We start with the same set of training data used to learn the primary classifier.
Because we only have a small set of explicitly labeled photobomb pairs, we include all such pairs in the
training dataset.
Any pair in this set that is explicitly labeled as a photobomb is given that label, otherwise it is labeled
as not a photobomb.
\section{PAIRWISE CLASSIFICATION EXPERIMENTS}\label{sec:pairexpt}
We evaluate the pairwise classifiers on two datasets, one of plains zebras and the another of Grévy's zebras
with $5720$ and $2283$ annotations, respectively.
The details of these datasets were previously described in \cref{sub:datasets}.
To evaluate our pairwise classifier, we choose a sample of annotation pairs from these datasets as detailed
in \cref{sec:learnclf}.
This results in $47312$ pairs for plains zebras and $18010$ pairs for Grévy's zebras.
The number of pairs per class is detailed in \cref{tbl:PairDBStats}.
Note that our datasets only contain a small number of labeled incomparable and photobomb cases.
%This is because these classes must be explicitly labeled between pairs of annotations, whereas positive and
% negative labels can be inferred depending on if a pair of annotation shares a name label.
For plains zebras only $53$ incomparable cases were explicitly labeled, while the other $300$ were generated
using heuristics.
For Grévy's zebras, there are no incomparable cases because all annotations have a right-side viewpoint.
Therefore, the primary focus of our experiments will be separating positive from negative matching states.
\PairDBStats{}
After sampling, we have a set of annotation pairs and each is associated with a ground truth matching state
label of either positive, negative, or incomparable.
Additionally, each pair is also labeled as either a photobomb or not a photobomb.
For each pair we construct a pairwise feature vector as described in \cref{sec:pairfeat}.
We split this dataset of labeled annotation pairs into multiple disjoint training and testing sets using
grouped stratified $k$-fold cross validation (we use $3$ folds).
Note that this grouping introduces a slight variation on standard stratified $k$-fold cross validation.
First, we enforce that each sample (a pair of annotations) within the same name or between the same two names
must be placed in the same group.
Then, the cross validation is constrained such that all samples in a group are either in the training or
testing set for each split.
In other words, this means that a specific individual cannot appear in both the training and testing set.
The same is true for specific pairs of individuals.
By grouping our cross-validation folds in this way, we ensure that the classifier cannot exploit individual
specific information to improve its predictions on the test set.
This increases our confidence that our results will generalize to new individuals.
For each cross validation split, we train the matching state and photobomb-state classifier on the training
set and then predict probabilities on each sample in the testing set.
Because the cross validation is $k$-fold and the splits are disjoint, each sample appears in a testing set
exactly once.
This means that each sample in our dataset is assigned match-state and photobomb-state probabilities exactly
once.
Because each prediction is made using classifiers trained on disjoint data, the predictions will be unbiased.
Thus, each sample in the dataset has a match-state and photobomb-state probability.
Therefore, we can use all sample pairs to evaluate the performance of each classifier.
In our experiments we compare these predicted match-state probabilities to the scores generated by LNBNN.
In order to do this, we must generate an LNBNN score for each pair in our sample.
This is done by first taking the unique set of annotations in the sample of pairs.
Then, we use these annotations as a database.
We issue each annotation as a query, taking care to ignore self-matches.
Any (undirected) pair in our dataset that appears as a query / database pair in the ranked list is assigned
that LNBNN score.
If the same pair is in two ranked lists, then the maximum of the two scores is used.
Any pair that does not appear in any ranked list (because LNBNN failed to return it) is implicitly given a
score of zero.
Note that the scores from the LNBNN ranking algorithm can only be used to distinguish positive cases from
non-positive pairs.
Unlike the match-state classifier, the LNBNN scores cannot be used to distinguish non-positive cases as
either negative or incomparable.
Later in \cref{chap:graphid}, it will be vitally important to distinguish these cases, but for now, in order
to fairly compare the two algorithms, we only consider positive probabilities from the match-state
classifier.
We have now predicted match-state probabilities, photobomb-state probabilities, and one-vs-many LNBNN scores
for each pair in our dataset.
In the next subsections we will analyze the predictions of each classifier.
For both the match-state and photobomb-state classifiers we will measure the raw number of classification
errors and successes in a confusion matrix.
We will then use standard classification metrics like precision, recall, and the Matthews correlation
coefficient to summarize these confusion matrices.
We will also inspect the importance of each feature dimension of our pairwise feature vector as measured
during random forest learning.
For each classifier we will present several examples of failure cases to illustrate where improvements are
can be made.
Additionally, we will compare the match-state classifier to LNBNN in two ways.
First, we will compare the original LNBNN ranking against a re-ranking using the positive probabilities.
Second, we will compare the ability of LNBNN and the match-state classifier to predict if a pair is positive
or not by looking at the distribution of positive and non-positive scores as well as the ROC curves.
\FloatBarrier{}
\subsection{Evaluating the match-state classifier}
The primary classifier predicts the matching state (positive, negative, incomparable) of a pair of
annotations.
Each pair of annotations is assigned a probability for each of these states, and those probabilities sum
to one.
In this context we classify a pair as the state with the maximum predicted probability.
However, in practice we will choose thresholds for automatic classification where the false positive rate
is sufficiently low.
From these multiclass classifications we build a confusion matrix for plains and Grévy's zebras.
These confusion matrices are shown in \cref{tbl:ConfusionMatch}.
In \cref{tbl:EvalMetricsMatch} we summarize the information in each confusion matrix by computing the
precision, recall, and Matthews correlation coefficient (MCC)~\cite{powers_evaluation_2011} for each
class.
The MCC provides a measurement of overall multiclass classification performance not biased by the number
of samples contained in each class.
An MCC ranges from $+1$, indicating perfect predictions, to $-1$, indicating pathological inverse
predictions, and $0$ represents random predictions.
The measured MCC of $0.83$ for plains zebras and $0.91$ for Grévy's zebras, indicates that our
match-state classifiers have strong predictive power.
\ConfusionMatch{}
\EvalMetricsMatch{}
\FloatBarrier{}
In addition to being strong predictors of match state, we show that the positive probabilities from our
classifiers can be used to re-rank the ranked list produced by LNBNN.
Using the procedure from our experiments in~\cref{sec:rankexpt}, we issue each annotation in our testing
set as a query and obtain a ranked list.
Using the fraction of correct results found at each rank, we construct a cumulative match characteristic
(CMC) curve~\cite{decann_relating_2013}.
We denote this CMC curve as \pvar{ranking}.
Then we take each ranked list and compute match-state probabilities for each top ranked pair of
query/database annotations.
We use the positive probabilities to re-rank the lists, and then we construct another CMC curve
corresponding to these new ranks.
This re-ranked CMC curve is denoted as \pvar{rank+clf}.
The results of this experiment --- illustrated in~\cref{fig:ReRank} --- clearly demonstrate that the
number of correct matches returned at rank $1$ is improved by re-ranking with our pairwise classifier.
\ReRank{}
\FloatBarrier{}
\subsubsection{Binary positive classification}
Although the accuracy of multiclass predictions is important, the most important task in animal
identification is to determine when a pair of annotations is positive and when it is not.
Therefore, we design an experiment that tests how well our match-state classifier can distinguish
positive pairs from other cases.
We will compare our learned classifiers to a baseline method that simply uses the LNBNN scores.
%Ideally our learned classifiers will result in superior separation when compared to using just the LNBNN
% scores.
We begin this comparison by illustrating histograms of LNBNN scores and positive pairwise probabilities
for positive and non-positive cases in \cref{fig:PositiveHist}.
The advantages of the pairwise scores are immediately noticeable.
The pairwise scores provide a superior separation between positive and non-positive cases.
Furthermore, the pairwise scores range from zero to one, which makes them easier to interpret than the
unbounded LNBNN scores.
We can make a more direct and precise comparison by considering both LNBNN scores and positive pairwise
probabilities as binary classifiers.
In this context, we can measure the separability of each method using the area under an ROC curve.
The ROC curves comparing the LNBNN scores and the learned probabilities are illustrated
in~\cref{fig:PositiveROC}.
In all cases the learned AUC is significantly better than the AUC of LNBNN .
For plains zebras, the pairwise AUC is $0.97$ and the LNBNN AUC is $0.82$.
For Grévy's zebras the pairwise AUC is $0.99$, and the LNBNN AUC is $0.89$.
These experiments clearly demonstrate that the pairwise classifier outperforms LNBNN in this
classification task.
In addition to illustrating the effectiveness of our classifiers when classifications are made for all
samples, the ROC curves --- which plot the true positive rate and the false positive rate as a function
of a threshold --- demonstrate that an operating point can be chosen to automatically classify a
significant number of positive pairs while making very few mistakes.
For plains zebras, a true positive rate of $0.25$ results in $4146$ true positives and only $38$ false
positives.
For Grévy's, an operating point with a true positive rate of $0.5$, results in $2501$ true positives and
only $28$ false positives.
Therefore, by carefully selecting an operating point we can bypass a significant number of manual reviews
without making a significant number of mistakes when adding new annotations to our dataset.
\PositiveHist{}
\PositiveROC{}
\FloatBarrier{}
\subsubsection{Feature importance}
To better understand which feature dimensions are the most useful for classification, we measure the
``mean decrease impurity'' (MDI)~\cite{louppe_understanding_2014} of each feature dimension.
The MDI is a measure of feature importance that is computed during the training phase.
As each decision tree is grown, for each node, we record the number of training samples of each class
that reach it.
This is used to compute the impurity of each node, \ie{} the entropy of class labels.
Each node is weighted using the fraction of total samples that reach it.
The weighted impurity decrease of a node is its weighted impurity minus the weighted sum of its
children's impurity.
The MDI for a single feature dimension in a single tree is computed as the weighted impurity decrease of
all nodes using that feature.
The overall MDI for the forest is obtained by averaging over all trees.
Using the MDI we ``prune'' the dimensions of our sample feature vectors by removing the least important
dimensions.
We measure the effect of pruning on classification accuracy using a greedy algorithm.
First we learn a random forest on a training set and compute the MCC on a test set.
Then we find the feature dimension with the lowest MDI and remove it from the dataset.
We repeat these two steps until there is only a single feature dimension remaining.
The impact of pruning on classification accuracy is illustrated in~\cref{fig:MatchPrune}, where we plot
the MCC as a function of the number of feature dimensions remaining.
The results of this experiment indicate that there is a small increase in classification accuracy from
pruning features dimensions.
We find that reducing the number of feature dimensions to $25$ increases the MCC by $0.0155$ for plains
zebras, and using $61$ dimensions increases the MCC by $0.0048$ for Grévy's.
Note that once the number of feature dimensions falls below ${\sim}20$ the performance starts to degrade
and suffers a harsh drop at ${\sim}10$ dimensions.
This suggests that these top features are the most important to making a match-state prediction.
The numeric importance of these top $10$ pruned feature dimensions is reported in
\cref{tbl:ImportantMatchFeatPrune}.
For both species we find that the most important features are statistics involving the ratio score.
These statistics indicate the distinctiveness and similarity of a pair of annotations.
Statistics about the spatial verification error, which signifies when the matches are not well aligned,
are also important for both species.
For plains zebras, the global viewpoint is important because the dataset contains incomparable examples.
Even though we have shown that a small improvement can be made by pruning to only the most important
feature dimensions, we choose to use full $131$ dimensions in the remainder of our experiments because
the overall performance gain is small.
\MatchPrune{}
\ImportantMatchFeatPrune{}
\FloatBarrier{}
\subsubsection{Failure cases}
Lastly we investigate several causes of failure.
The primary reasons that cause the match-state classifier to fail are similar to those that create
difficulty for the ranking algorithm.
These were previously discussed in~\cref{sub:exptfail}.
Factors such as viewpoint, quality, and occlusion are inherently challenging because they reduce the
similarity between matching descriptors, and increase the disparity between the annotations.
This makes it difficult to correctly establish and spatially verify feature correspondences.
Photobombing animals and scenery matches also pose a challenge to the match-state classifier, often
causing it to produce ambiguous probabilities.
We separate failure cases into three main categories:
(1) failure to classify a pair as positive,
(2) failure to classify a pair as negative, and
(3) failure to classify a pair as incomparable.
The examples in~\cref{fig:PairFailPN} illustrate the first and most important failure case category.
Due to occlusion, quality, and viewpoint, the established correspondences were not distinctive enough for
the classifier to confidently predict positive.
However, in all but the last case the probabilities are ambiguous, which implies that improvements could
be made to fix these failures.
In the last case, only a few distinctive matches were made, which suggests that procedure that
establishes feature correspondences could be improved.
Examples of the second case are illustrated in~\cref{fig:PairFailNP}.
These examples were incorrectly classified as positive, even though they are negative.
In two cases this is due to scenery matches and photobombing animals.
In the other cases the pairwise classifier matches similar regions of the animal, but it is unable to key
in on the strong negative evidence provided by different distinctive patterns on corresponding body parts
of the animal.
In the future, algorithms powered by convolutional neural networks may be able to take advantage of this
strong negative evidence.
For the third case, it is not surprising that the examples in~\cref{fig:PairFailIN} were not labeled as
incomparable because only a small amount of incomparable training data was available.
Furthermore, photobombs and scenery matches also seem to cause a problem for incomparable examples.
Note that in the majority of all three types of failure, the examples have non-extreme probabilities
assigned to each state.
This demonstrates that the classifier is not confident in these failed predictions.
%We will take advantage of this in the next chapter.
Interestingly, inspection of the failure cases revealed several labeling errors in the database.
The examples illustrated in ~\cref{fig:MatchLabelErrors} all show pairs of annotations with non-positive
labels that should have been labeled as positive.
Note that the positive probabilities from two of these examples are very close to $1.0$, indicating that
the classifier is confident that these ground truth labels are incorrect.
Furthermore, these success cases were found in the context of noisy ground truth labels, showing that our
classifier is robust to errors in ground truth labels.
\PairFailPN{}
\PairFailNP{}
\PairFailIN{}
\MatchLabelErrors{}
\FloatBarrier{}
%---------
\FloatBarrier{}
\subsection{Evaluating the photobomb-state classifier}
In this subsection we perform a set experiments --- similar to those used to evaluate the match-state
classifier --- to demonstrate the effectiveness of our photobomb-state classifier.
Because there are only a few photobomb training examples, the random forest is not able to learn strong
probabilities.
In the initial design of these experiments, a pair was classified as a photobomb if its probability was
greater than $0.5$, but we found that this caused the MCC of Grévy's zebras to drop to $0.0$.
However, because this is a binary classification problem we can choose any threshold as an operating
point and classify a pair as a photobomb if its probability is above that threshold and as not a
photobomb otherwise.
Therefore, for each dataset, we choose a photobomb-state probability threshold to maximize the MCC as
illustrated in \cref{fig:PBThreshMCC}.
We classify a pair as a photobomb if its probability is above $0.13$ for plains zebras and $0.17$ for
Grévy's zebras.
Using these adjusted thresholds, the overall performance measured using the classification confusion
matrices in \cref{tbl:ConfusionPhotobombII} and evaluation metrics in \cref{tbl:EvalMetricsPhotobombII}.
\PBThreshMCC{}
\ConfusionPhotobombII{}
\EvalMetricsPhotobombII{}
Due to the small amount of available training data, the performance of the photobomb-state classifier is
weaker than the match-state classifier.
By choosing the appropriate thresholds we achieve an MCC of $0.34$ for plains zebras and $0.40$ for
Grévy's zebras.
These scores indicate that each photobomb-state classifier has weak but significant predictive power.
While these MCCs are not overwhelmingly strong, they do demonstrate that each classifier is able to learn
from only a few labeled training examples, and it seems likely that the MCCs would significantly improve
with more labeled training data.
From these measurements we can conclude that the photobomb-state classifier is learning.
\FloatBarrier{}
Our conclusion that the photobomb-state classifier is learning is supported the top $10$ most important
features as measured using the MDI.
It makes intuitive sense that the important features --- illustrated in~\cref{tbl:ImportantPBFeat} ---
would be the ones selected by the random forest learning algorithm.
Because pairs of annotations taken around the same place and time are more likely to be photobombs, each
random forest places the most weight on global feature dimensions such as time delta, GPS delta, and
speed.
The classifiers also makes use of the spatial position of the feature correspondences, which can be
indicative of a photobomb (\eg{} when all the matches are in the top left corner of an annotation).
Because the random forest seems to be selecting reasonable features, the main source of weakness is
likely due to the amount of training data.
\ImportantPBFeat{}
We gain further insight into the photobomb-state classifier by considering several failure cases
examples, which are illustrated in \cref{fig:PBFailures}.
In each example we observe that the classifier is either confused or it does not have enough information
to label a pair as a photobomb.
Furthermore, the classifier was able to find several new photobomb pairs that were mislabeled in the
database.
We illustrate several of these mislabeled cases in~\cref{fig:PBLabelErrors}.
Notice that many of the probabilities predicted for these cases are well above the thresholds.
Predicting high probabilities for these undiscovered photobomb cases indicates that the photobomb-state
classifier is stronger than our measurements suggest.
These mislabeled ground truth cases also help explain why the predicted probabilities are so low; the
learning algorithm is encouraged to incorrectly classify them.
Reviewing and relabeling all of these cases would improve results by both removing noise from the
training set and increasing the number of labeled examples.
\PBFailures{}
\PBLabelErrors{}
\FloatBarrier{}
\subsection{Classifier experiment conclusions}
In these experiments we have demonstrated that our pairwise match-state classifier is able to reliably
separate positive from negative and incomparable cases.
When making automatic decisions, these probabilities have several advantages over the LNBNN scores
from~\cref{chap:ranking}.
Not only do they have more predictive power, they are interpretable, and they always range between $0$
and $1$.
We will use these classifiers to make automatic decisions about pairs of annotations where the
match-state probability is above a threshold.
Our experiments have shown that it is possible to select a threshold where the false positive rate is
sufficiently low.
Furthermore, the classifier is able to improve the ranking algorithm by re-ranking its results.
Lastly, because the classifier predicts probabilities independent of their position in the ranked list,
it can be used to determine when a query individual is new --- \ie{} does not have a correct match in the
database.
The performance of secondary photobomb classifier is weaker, but this is likely due to a small amount of
training data.
Even in its weak state, it can be used to prevent automatic review of some photobomb cases by adjusting
the classification threshold to be less than $0.5$.
Because the photobomb classifier can only prevent automatic decisions and does not make them, the cost of
including it in our algorithms is small, and by doing so we will increase the amount of labeled training
data from which a stronger photobomb classifier can be bootstrapped.
\section{SUMMARY OF PAIRWISE CLASSIFICATION}\label{sec:pairconclusion}
In this chapter we have constructed a verification mechanism that can predict the probability that a pair of
annotations is positive, negative, or incomparable.
We have also constructed a secondary classifier that can predict when --- namely in the case of photobombs
--- a pair of annotations might confuse the primary match-state classifier.
This was done by constructing a feature vector that contains matching information about a pair of
annotations.
We have constructed a representative training set by selecting hard, moderate, and easy training examples.
We used the random forest learning algorithm to train our classifiers.
Our experiments demonstrate that the match-state classifier is able to strongly separate positive and
negative cases.
The performance of the photobomb-state classifier was weaker, but could likely be improved with more training
data.
Based on our experiments, it is clear that the ranking algorithm is improved by an automatic verifier, but by
themselves ranking and verification are not enough to robustly address animal identification.
There is no mechanism for error recovery, nor is there a mechanism for determining when identification is
complete.
This means the operating point for the automatic review threshold must be set conservatively to avoid any
errors, which results in more work for a human reviewer.
These issues are addressed in \cref{chap:graphid} using a graph-based framework to manage the identification
process.
This framework will use graph connectivity to further reduce the number of required manual reviews, detect
when errors have occurred, recover from the errors, and stop the identification
process in a timely manner.