chapter4-pairclf.tex


\begin{comment}
fixtex --reformat --fpaths chapter4-pairclf.tex --print
fixtex --fpaths chapter4-pairclf.tex --outline --asmarkdown --numlines=999 

fixtex --fpaths chapter4-pairclf.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter4-pairclf.md
https://www.languagetool.org/
\end{comment}

\newcommand{\nan}{\text{nan}}

\chapter{PAIRWISE VERIFICATION}\label{chap:pairclf}

In this chapter we consider the problem of verifying if two annotations are from the same animal or from different
animals. By addressing this problem we improve upon the ranking algorithm from \cref{chap:ranking} --- which ranks
the \names{} in a database based on similarity to a query --- by making semi-automatic decisions about results
returned in the ranked lists. The algorithms introduced in this chapter will assign a confidence to results in the
ranked list, and any pair above a confidence threshold can be automatically reviewed. We will demonstrate that our
decision algorithms can significantly reduce the number of manual interactions required to identify all individuals
in an unlabeled set of annotations.

To make semi-automatic decisions up to a specified confidence we develop a \emph{pairwise probabilistic
  classifier} that predicts a probability distribution over a set of events given two annotations (typically a
  query annotation and one of its top results in a ranked list).
Given only the information in two annotations,  there are three possible decisions that can be made.
A pair of annotations is either:
\begin{enumln}
    \item incomparable --- the annotations are not visually comparable,

    \item positive --- the annotations are visually comparable and the same individual, or

    \item negative --- the annotations are visually comparable and different individuals.
\end{enumln}
Two annotations can be incomparable if the annotations show different parts or sides of an animal, or if the
  distinguishing information on an animal is obscured or occluded.
The positive and negative states each require distinguishing information to be present.
These mutually exclusive ``match-states'' are illustrated in \cref{fig:MatchStateExample}.
The multi-label classifier then predicts the probability of each of the three states, with the probabilities
  necessarily summing to $1$.

\MatchStateExample{}

To construct a pairwise probabilistic classifier we turn towards supervised machine learning.
This requires that we:
\begin{enumin}
    \item determine a set of labeled annotation pairs for training,

    \item construct a fixed-length feature vector to represent a pair of annotations,  and

    \item choose a probabilistic learning algorithm.
\end{enumin}
The first requirement can be satisfied by carefully selecting representative annotations pairs, and the last
  requirement is satisfied by many pre-existing algorithms (\eg{} random forests and neural networks).
The second requirement --- constructing an appropriate fixed-length feature vector --- is the most challenging.
Given enough training data and a technique to align the animals in two annotations, using image data with a
  Siamese or triplicate network~\cite{taigman_deepface_2014,schroff_facenet_2015} might be appropriate, but without
  both of these pre-conditions we must turn towards more traditional methods.
Recall from \cref{sec:annotrepr} that our annotation representation is an unordered bag-of-features, which cannot
  be directly fed to most learning algorithms.
Therefore, we develop a method for constructing a fixed length \glossterm{pairwise feature vector} for a pair of
  annotations.
This novel feature vector will take into account local matching information as well as more global information
  such as GPS and viewpoint.
A collection of these features from multiple labeled annotation pairs is used to fit a random
  forest~\cite{breiman_random_2001} which implements our pairwise classifier.
We choose to use random forest classifiers, in part because they are fast to train, insensitive to feature
  scaling, robust to overfitting, and naturally output probabilities in a multiclass setting, and in part because
  they can handle (and potentially take advantage of) missing data --- \ie{} \nan{} values in feature vectors ---
  using the ``separate class'' method~\cite{ding_investigation_2010}.
  

A final concern investigated in this chapter is the issue of image challenges that may confound the match-state
  pairwise classifier.
Recall from~\cref{sub:exptfail}, {photobombs} --- pairs of annotations where feature correspondences are caused
  by a secondary animal --- are the most notable cause of such a challenge.
Photobombs appear very similar to positive matches, and this similarity could confuse the match-state classifier.
However, because photobombs are inherently a pairwise property between annotations, it should be possible to
  learn a separate classifier explicitly tasked with the challenge.
Therefore, we also learn a photobomb classifier using the same sort of pairwise feature vector and random forest
  classifier.
This supporting classifier will allow us to increase the accuracy of our identification by restricting automatic
  classification to pairs where the decision is straightforward.


This outline of this chapter is as follows.
\Cref{sec:pairfeat} details the construction of the feature vector that we use as input to the pairwise
  classifier.
\Cref{sec:learnclf} describes the process of collecting training data and learning the match-state pairwise
  classifier.
\Cref{sec:learnpb} extends this approach to predict secondary attributes (\eg{} if a pair is a photobomb) beyond
  just the matching state.
\Cref{sec:pairexpt} presents a set of experiments that evaluate the pairwise classifier.
\Cref{sec:pairconclusion} summarizes and concludes this chapter.


\section{CONSTRUCTING THE PAIRWISE FEATURE VECTOR}\label{sec:pairfeat}

In order to use the random forest learning algorithm to address the problem of pairwise verification, we must
  construct a feature vector to representing a pair of annotations.
This feature vector must contain enough information to differentiate between the positive, negative, and
  incomparable classes.
In contrast to the unordered bag-of-features used to represent an annotation, the dimensions in this feature
  vector must be ordered and each dimension should correspond to a specific measurement.
In practice this means that the feature vector must be ordered and have a fixed length.

We construct this feature vector to contain both global and local information.
Global information is high level metadata about the annotation pair and includes non-visual information.
The local information aggregates statistics about feature correspondences between the two annotations.
The local and global vectors are constructed separately and then concatenated to form the final pairwise feature
  vector.
The remainder of this section discusses the construction of these vectors.

\subsection{The global feature vector}

The global feature vector contains information that will allow the classifier to take advantage of semantic
  labels and non-visual attributes of our data to solve the verification problem.
Semantic labels such as quality and viewpoint are derived from visual information and can provide abstract
  knowledge to help the classifier make a decision.
Non-visual attributes such as GPS and timestamp can be extracted from EXIF metadata and may help determine facts
  not discernible from visual data alone.
The global feature vector is derived from the following ``unary'' attributes extracted from each annotation individually:
\begin{enumln}

    \item Timestamp, represented in POSIX format as a float.

    \item GPS latitude and longitude, represented in radians as two floats. 

    \item Viewpoint classification label, given as a categorical integer ranging from $1$ to $8$.

    \item Quality classification label, given as a categorical integer ranging from $1$ to $5$.
\end{enumln}
We gather the GPS and timestamp attributes from image EXIF data, and the viewpoint and quality labels are outputs
  of the deep classifiers previously discussed in \cref{subsec:introdataprocess}.
The GPS and timestamp attribute inform the classifier of when it is not possible for two annotations to match
  (\eg{} when a pair of annotations is close in time but far in space) and when two annotations were taken around
  the sample place and time.
Pairs of annotations taken around the same place and time tend to have a higher similarity and are more likely to
  contain photobombs and scenery matches.
The viewpoint and quality attributes should help the classifier predict when pairs of annotations are not
  comparable --- forcing there to be stronger evidence to form a match, such as strong correspondences on a face
  turned toward the camera in both a left and right side view.
An example illustrating such a case --- where two annotations with different viewpoints are a positive match ---
  is illustrated in \cref{fig:LeftRightFace}.

\LeftRightFace{}

These four ``unary'' attributes are gathered for each annotation.
Thus, for each attribute we have two measurements, one for each annotation.
However, we do not use them directly because the ordering of the annotations in each pair is arbitrary.
For each unary attribute, we either ignore it (as in the case of GPS and time) or record the minimum of the two
  values in one feature dimension and the maximum in another (as is done with viewpoint and quality).
This results in $4$ unary measurements, $2$ for viewpoint and $2$ for quality.
These measurements are appended to the front of the global feature vector.

The remaining dimensions of the global feature vector are constructed by encoding relationships between pairs of
  unary attributes using distance measurements.
In the case of GPS coordinates we use the haversine distance (as detailed in \cref{app:occurgroup}).
In the case of viewpoint we use a cyclic absolute difference --- \ie{} the distance between viewpoints $v_1$ and
  $v_2$ is $\min(|v_1 - v_2|, 8 - |v_1 - v_2|)$.
For quality and time we simply use the absolute difference between their values.
This results in $4$ pairwise measurements, one for each global attribute.
Lastly, we include the ``speed'' of the pair, which is the GPS-distance divided by the time-delta.
Thus, the total number of global measurements is $4 + 4 + 1 = 9$.

In the event that an attribute is not provided or not known (\eg{} the EXIF data is missing), then a measurement
  cannot be made, so a \nan{} value is recorded instead.
To apply random forests learning, these \nan{} values must be handled by either modifying the learning algorithm
  or replacing them with a number.
Ding and Simonoff investigate several methods for handling missing data in~\cite{ding_investigation_2010}, and
  they conclude that the best choice is application dependent.
For our application we choose the ``separate class'' method because their experiments demonstrate that it
  performs the best when \nan{} values are in both the training and testing data, which is the case for our data.

In addition to being the best fit for our application, the separate class method is simple.
The idea is to replace all \nan{} measurements with either an extremely large or extremely small number.
The choice of large or small is made independently at each node in the decision tree, depending on which choice
  is best.
In this way the \nan{} values are essentially treated as a separate category because a test can always be chosen
  that separates the measured and unmeasured data.
This has several effects.
In the case that a \nan{} measurement in a feature dimension is informative (\eg{} if images without timestamps
  are less likely to match other annotations), the random forest can take advantage of that dimension.
However, in the more likely case that the same \nan{} measurement is uninformative, the dimension can still be
  used, but it is penalized proportional to the fraction of samples where it takes a \nan{} value.
This captures the idea that a feature dimension is less likely to be informative if it cannot be measured
  reliably.
However, if that feature dimension is highly informative for samples where it has a numeric value, then a node in
  a decision tree can still make use of it, and the samples with \nan{} values can be split by nodes deeper in the
  tree.

\subsection{The local feature vector}
The local feature vector distills two orderless bag-of-features representations into a fixed length vector
  containing matching information about a pair of annotations.
Three basic steps are needed to construct the local feature vector.
First we determine feature correspondences between the two annotations.
Then for each correspondence we make several measurements (\eg{} descriptor distance and spatial position).
Finally, we aggregate these measurements over all correspondences using summary statistics (\eg{} mean, sum,
  std).
Later we augment this basic scheme by constructing multiple sets of feature correspondences.
Thus, the total length of the feature vector is the number of measurements times the number of summary statistics
  times the number of ways feature correspondences are established.


\subsubsection{Establishing feature correspondences}
To determine feature correspondences between two annotations, $\qaid$ and $\daid$, we use what we refer to as a
  one-vs-one matching algorithm.
Each annotation's descriptors are indexed for fast nearest neighbor search~\cite{muja_fast_2009}.
Keypoint correspondences are formed by searching for the reciprocal nearest neighbors between annotation
  descriptors~\cite{qin_hello_2011}.
For each feature in each correspondence, the next nearest neighbor is used as a normalizer for Lowe's ratio
  test~\cite{lowe_distinctive_2004}.
Because matching is symmetric, each feature correspondence is associated with two normalizing neighbors.
The feature / normalizer pair with the minimum descriptor distance is used as a normalizing pair.
If the ratio of the descriptor distance between a correspondence to the distance between the normalizing pair is
  above a threshold, the correspondence is regarded as non-distinct and removed.
For the simplicity of the description we consider just one ratio threshold for now, but later we will describe
  this process using multiple thresholds.
Spatial verification~\cite{philbin_object_2007} is applied to further refine the correspondences.
This one-vs-one matching algorithm results in a richer set of correspondences between annotations than would be
  found using the LNBNN ranking algorithm.

\subsubsection{Local measurements}
After the one-vs-one matching stage, several measurements are made for each feature correspondence.
Before describing these measurements, it will be useful to set up some notation.
Given two annotations $\qaid$ and $\daid$, consider a feature correspondence between two features $i$ and $j$
  with descriptors $\desc_i$ and $\desc_j$.
Let $\descnorm_i$ be the normalizer for $i$, and let $\descnorm_j$ be the normalizer for $j$.
Note that while $i$ is from $\qaid$, its normalizer, $\descnorm_i$, is a descriptor from $\daid$.
The converse is true for $j$.
Let $c \in \curly{i, j}$ indicate which feature / normalizer pair is used in the ratio test. %
Thus, $c = \argmin{c \in {i, j}}{\elltwo{\desc_c - \descnorm_c}}$.
Given these definitions, the measurements we consider are:

\begin{itemln}

    \item Foregroundness score:
    This is the geometric mean of the features' foregroundness measures, $\sqrt{w_i w_j}$.
    This adds $1$ measurement, denoted as $\tt{fgweight}$, for each correspondence.

    \item Correspondence distance:
    This is the Euclidean distance between the corresponding descriptors, $\elltwo{\desc_i - \desc_j} / Z$.
    This serves as a measure of visual similarity between the features.
    (Recall $Z\tighteq\sqrt{2}$ for SIFT descriptors).
    This adds $1$ measurement, denoted as $\tt{match\_dist}$, for each correspondence.

    \item Normalizer distance:
    This is the distance between a matching descriptor and the normalizing descriptor, %
    $\elltwo{\desc_c - \descnorm_c} / Z$.
    This serves as a measure of visual distinctiveness of the features.
    We also include a weighted version of this measurement by multiplying it with the foregroundness score.
    This adds $2$ measurements, denoted as $\tt{norm\_dist}$ and $\tt{weighted\_norm\_dist}$, for each
      correspondence.

    \item Ratio score:
    This is the one minus the ratio of the correspondence distance to the normalizer distance, %
    $1 - \elltwo{\desc_i - \desc_j} / \elltwo{\desc_c - \descnorm_c}$.
    This combines both similarity and distinctiveness.
    Note that this is one minus the measure used to filter correspondences in the ratio test.
    We perform this subtraction in order to obtain a score that varies directly with distinctiveness --- \ie{}
      the ratio score can increase by decreasing the visual difference or increasing the distinctiveness.
    We also include a weighted version of this measurement by multiplying it with the foregroundness score.
    This adds $2$ measurements, denoted as $\tt{ratio\_score}$ and $\tt{weighted\_ratio\_score}$, for each
      correspondence.
        
    \item Spatial verification error:

        This is the error in location, scale, and orientation, $( %
            \Delta \pt_{i, j}, %
            \Delta \scale_{i, j}, %
        \Delta \ori_{i, j})$, as measured using~\cref{eqn:inlierdelta}.
        This adds $3$ measurements, denoted as $\tt{sver\_err\_xy}$, $\tt{sver\_err\_scale}$, and
          $\tt{sver\_err\_ori}$, for each correspondence.
        These measurements carry information about the alignment between the feature correspondences.

    \item Keypoint relative locations:

        These are the xy-locations of the keypoints divided by the width and height of the annotation  %
        $x_i / w_{\qaid}, y_i / h_{\qaid}$ and $x_i / w_{\daid}, y_j / h_{\daid}$.
        These measurements carry information about the spatial distribution of the feature correspondences.
        This will be useful for determining if a pair of annotations is a photobomb because often the
          photobombing animal is not in the center of the annotation.
        This adds $4$ measurements, denoted as $\tt{norm\_x1}$, $\tt{norm\_y1}$, $\tt{norm\_x2}$, and
          $\tt{norm\_y2}$, for each correspondence.
        Note that unlike the global quality and viewpoint measures, we do not make an effort to account for the
          arbitrary ordering of annotations when recording these local features.
        This is to preserve the association between the spatial dimensions of each annotation.
        The same is true for the next feature.

    \item Keypoint scales:

        These are the keypoint scale parameters $\sigma_i$ and $\sigma_j$, as measured
          using~\cref{eqn:affinewarp}.
        The scales indicate the size of each keypoint with respect to its annotation.
        This may be useful for disregarding matches between large coarsely described features that do not carry a
          significant amount of individually distinctive information.
        This adds $2$ measurements, denoted as $\tt{scale1}$ and $\tt{scale2}$, for each correspondence.

\end{itemln}

\subsubsection{Summary statistics}
Once these $15$ measurements have been made for each keypoint correspondence we summarize them using summary
  statistics.
We consider the sum, median, mean, and standard deviation over all correspondences.
%We consider the sum, inverse-sum, mean, and standard deviation over all correspondences.
We have also considered taking values from individual correspondences based on rankings and percentiles with
  respect to a particular attribute (\eg{} ratio score), however we found that these did not improve the
  performance of our classifiers.
In practice the summary statistics work quite well.
The resulting measurements are stacked to form the local feature vector.
This results in $15 \times 4 = 60$ measurements.
A final step we have found useful is to append an extra dimension simply indicating the total number of feature
  correspondences.
So, in total there are $61$ summary statistics computed for a set of feature correspondences.

\subsubsection{Multiple ratio thresholds}
As previously noted we establish multiple groups of feature correspondences for different threshold values of the
  ratio test.
We do this because we have observed that some positive annotation pairs had all of their correspondences filtered
  by the ratio test.
However, when we increase the ratio threshold, the overall classification performance decreases.
Therefore, we include both small and large thresholds to allow the classifier to have access to both types of
  information.
Including larger threshold values helps ensure that most pairs generate at least a few correspondences, while
  smaller threshold values capture the information in highly distinctive correspondences.
Overall, this softens the impact of the ratio test's binary threshold and adds robustness to viewpoint and pose
  variations that may cause correspondences to appear slightly less distinctive.

The details of this process are as follows:
Once we have assigned feature correspondences using symmetric nearest neighbors, we create a group of
  correspondences for each ratio value.
The members of each group are all correspondences with a ratio score less than that value.
Each group is then spatially verified, and the union of the groups is the final set of correspondences.
When measuring spatial verification errors, each keypoint may be associated with multiple values.
Therefore, we use the minimum spatial verification error over all values of the ratio threshold.
The local feature vector is constructed by applying summary statistics to each of these groups independently and
  then concatenating the results.
Thus, the size of the local feature vector is multiplied by the number of ratio thresholds used.

While using multiple values of the ratio test can further enrich the pairwise local feature vector, there are two
  trade-offs that must be taken into account.
First, spatial verification must be run multiple times, which noticeably increases computation time.
Second, the resulting size of the feature vector is larger, which can make learning more difficult due to the
  curse of dimensionality.
Therefore, in our implementation we choose to use only two ratio threshold values of $0.625$ and $0.8$.
Thus, the total number of local measurements is $2 \times 61 = 122$.

\subsubsection{Additional notes}
We have found that, for some species like plains zebras, it is important to use the keypoint orientation
  heuristic described in \cref{sec:annotrepr} when computing one-vs-one matches.
This heuristic causes each keypoint to extract $3$ descriptors instead of $1$.
In this case we should not use the second nearest neighbor as the normalizer for the ratio test, because the
  augmented keypoints may have similar descriptors.
We account for this by using the $3\rd{}$ nearest neighbor as the normalizer instead.

% NOTE: Old way.
%Once we have assigned feature correspondences, we filter these correspondences using the ratio test with the
  %maximum value of the ratio threshold.
%Note that these threshold points are with respect to the original ratio values, and not the ratio scores used in
  %the feature vector.
%Then the remaining feature correspondences are spatially verified as normal.
%At this point, we create a group of correspondences for each value of the ratio threshold.
%The member of each group are all correspondences with a ratio score less than that value.

\FloatBarrier{}
\subsection{The final pairwise feature vector}

The final pairwise feature vector is constructed by concatenating the local and the global vector.
This results in a $131$ dimensional vector containing information that a learning algorithm can use to predict if
  a pair of annotations is positive, negative, or incomparable.
Of these dimensions, $9$ are from global measurements and $122$ are from local measurements.
The example in~\cref{fig:PairFeatVec} illustrates part of a final feature vector.

%\PairFeatVec{}
\begin{figure}
\begin{minted}[gobble=4]{python}
    OrderedDict([('global(qual_min)',    3),
                 ('global(qual_max)',    nan),
                 ('global(qual_delta)',  nan),
                 ('global(gps_delta)',   5.79),
                 ('len(matches[ratio < .8])',        20),
                 ('sum(ratio_score[ratio < .8])', 10.05),
                 ('mean(ratio_score[ratio < .8])', 0.50),
                 ('std(ratio_score[ratio < .8])',  0.09)])
\end{minted}
\captext[\caplbl{PairFeatVec}A pairwise feature vector]{
% ---
This is an example of a small pairwise feature vector containing local and global information.
The feature vector we use in practice contains $131$ dimensions.
Note the summary statistics in this example are all computed for correspondences with a ratio value that is less
  than $0.8$.
% ---
}
\label{fig:PairFeatVec}
\end{figure}


\section{LEARNING THE MATCH-STATE CLASSIFIER}\label{sec:learnclf}

    Having defined the pairwise feature vector there are two remaining steps to constructing the pairwise
      classifier.
    We must:
    (1) choose a probabilistic learning algorithm, and
    (2) select a sample of labeled training data.
    We have previously stated that we will use the random forest~\cite{breiman_random_2001} as our probabilistic
      classifier.
    Therefore, we briefly review the details of random forest learning and prediction.
    Then we describe the sampling procedure used to generate a dataset of labeled annotation pairs.

    \subsection{The random forest learning algorithm}
    The random forest learning algorithm~\cite{breiman_random_2001} is well understood, so we only provide a
      brief overview.
    A random forest is constructed by learning a forest of decision trees.
    Learning begins by initializing each decision tree as a single node.
    Each root node is assigned a random sample of the training data with replacement, and then a recursive
      algorithm is used to grow each root node into a decision tree.
    Each iteration of the recursive algorithm is given a leaf node, and will choose a test to split the training
      data at the node into two child nodes.
    The test is constructed by first choosing a random subset of feature dimensions.
    We then find choose a dimension and threshold to maximize the decrease in class-label entropy.
    Note that when using the ``separate class'' method, the algorithm tests placing samples with missing data on
      both the left and right side of the split.
    The algorithm is then recursively executed on the right and left node until a leaf is assigned fewer than a
      minimum number of training examples.
    To select a test for a node, the number of candidate features dimensions we choose is the square root of the
      total number of feature dimensions.
    Each decision tree predicts a probability distribution over all classes for a testing example by descending
      the tree, choosing left or right based on the test chosen at the node until it reaches a leaf node.
    The predicted probabilities are the proportions of training class-labels at that leaf node.
    The probability prediction of the random forest is the average of the probabilities predicted by all decision
      trees.
    We use the random forest implementation provided by Scikit Learn~\cite{pedregosa_scikit_learn_2011}.
    To choose hyper-parameters, we preform a grid search.
    We find that it works best to use $256$ decision trees, and to stop branch growth once a leaf node contains
      $5$ or fewer training samples.
    For other parameters we use the implementation defaults.

    \subsection{Sampling a labeled dataset of annotation pairs}
    Now that we have chosen a learning algorithm, the last remaining step is the selection of training data and
      generation of labels.
    Recall that the purpose of our classifier is to output a probability distribution over three labels:
    positive, negative and incomparable.
    Given a pair of annotations we need to assign one of these three labels using ground truth data.
    In recent versions of our system, this ground truth label is stored along with each unordered pair of
      annotations that has been manually reviewed, but because this is a new feature, only a few pairs have been
      assigned an explicit three-state label.
    Therefore, we must make use of the name and viewpoint labels assigned to each annotation by previous versions
      of the system.
    This allows us to determine if an annotation pair shows the same or different animals, but it does not allow
      us to determine if the pair is comparable.
    To account for this we use heuristics to assign the incomparable label using the viewpoints, and if either
      annotation is not assigned a viewpoint it is assumed that they are comparable because most images in our
      datasets are taken from a consistent viewpoint (\ie{} collection events were designed to reduce
      incomparability).

    Given a pair of annotations, a training label is assigned as follows.
    First, if an explicit three-state label exists, return it.
    Otherwise, we must heuristically decide if the pair is comparable based on viewpoint information.
    If the heuristics determine that a pair is not comparable, then return incomparable.
    In all other cases return positive if the annotations share a name label and negative if they do not.

    In order to select pairs from our ground truth dataset, we sample representative pairs of annotations guided
      by the principle of selecting examples that exhibit a range of difficulties~\cite{shi_embedding_2016} (\eg
      hard-negatives and moderate positives).
    We use the LNBNN ranking algorithm to estimate how easy or difficult it might be to predict the match-state
      of a pair.
    Pairs with higher LNBNN scores will be easier to classify for positive examples and will be more difficult
      for negative examples, and lower scores will have the reverse effect.

    Specifically, to sample a dataset for learning, we first rank the database for each query image using the
      ranking algorithm.
    We partition the ranked lists into two parts:
    a list of correct matches and a list of incorrect matches.
    We select annotations randomly from the top, middle, and bottom of each list.
    For positive examples we select $4$ from the top, $2$ from the middle, $2$ from the bottom, and $2$ randomly.
    For negative examples we select $3$ from the top, $2$ from the middle, $1$ from the bottom, and $2$ randomly.
    If there are not enough examples to do this, then all are taken.
    We include all pairs explicitly labeled as incomparable because there are only a few such examples.
    If this was not the case, then we would include an additional partition for incomparable cases.


\section{SECONDARY CLASSIFIER TO ADDRESS PHOTOBOMBS}\label{sec:learnpb}
    It is useful to augment the primary match-state pairwise classifier with a secondary classifier able to
      determine if a pair of annotations contains information that might confuse the main classifier.
    These confusing annotation pairs should not be considered for automatic review.
    One of the most challenging of these secondary states is one that we refer to as a {photobomb}.
    A pair of annotations is a photobomb if a secondary animal in one annotation matches an animal in another
      annotation (\eg see~\cref{fig:PhotobombExample}).
    Only the primary animal in each annotation should be used to determine identity, but photobombs provide
      strong evidence of matching that can confuse a verification algorithm.
    \PhotobombExample{}

    \FloatBarrier{}
    During events like the \GZC{} we labeled several annotation pairs as photobombs.
    Using these labels we will construct a classifier in the same way that we constructed the primary match-state
      classifier.
    We start with the same set of training data used to learn the primary classifier.
    Because we only have a small set of explicitly labeled photobomb pairs, we include all such pairs in the
      training dataset.
    Any pair in this set that is explicitly labeled as a photobomb is given that label, otherwise it is labeled
      as not a photobomb.

\section{PAIRWISE CLASSIFICATION EXPERIMENTS}\label{sec:pairexpt}


    We evaluate the pairwise classifiers on two datasets, one of plains zebras and the another of Grévy's zebras
      with $5720$ and $2283$ annotations, respectively.
    The details of these datasets were previously described in \cref{sub:datasets}.
    To evaluate our pairwise classifier, we choose a sample of annotation pairs from these datasets as detailed
      in \cref{sec:learnclf}.
    This results in $47312$ pairs for plains zebras and $18010$ pairs for Grévy's zebras.
    The number of pairs per class is detailed in \cref{tbl:PairDBStats}.
    Note that our datasets only contain a small number of labeled incomparable and photobomb cases.
    %This is because these classes must be explicitly labeled between pairs of annotations, whereas positive and
    %  negative labels can be inferred depending on if a pair of annotation shares a name label.
    For plains zebras only $53$ incomparable cases were explicitly labeled, while the other $300$ were generated
      using heuristics.
    For Grévy's zebras, there are no incomparable cases because all annotations have a right-side viewpoint.
    Therefore, the primary focus of our experiments will be separating positive from negative matching states.

    \PairDBStats{}

    After sampling, we have a set of annotation pairs and each is associated with a ground truth matching state
      label of either positive, negative, or incomparable.
    Additionally, each pair is also labeled as either a photobomb or not a photobomb.
    For each pair we construct a pairwise feature vector as described in \cref{sec:pairfeat}.

    We split this dataset of labeled annotation pairs into multiple disjoint training and testing sets using
      grouped stratified $k$-fold cross validation (we use $3$ folds).
    Note that this grouping introduces a slight variation on standard stratified $k$-fold cross validation.
    First, we enforce that each sample (a pair of annotations) within the same name or between the same two names
      must be placed in the same group.
    Then, the cross validation is constrained such that all samples in a group are either in the training or
      testing set for each split.
    In other words, this means that a specific individual cannot appear in both the training and testing set.
    The same is true for specific pairs of individuals.
    By grouping our cross-validation folds in this way, we ensure that the classifier cannot exploit individual
      specific information to improve its predictions on the test set.
    This increases our confidence that our results will generalize to new individuals.

    For each cross validation split, we train the matching state and photobomb-state classifier on the training
      set and then predict probabilities on each sample in the testing set.
    Because the cross validation is $k$-fold and the splits are disjoint, each sample appears in a testing set
      exactly once.
    This means that each sample in our dataset is assigned match-state and photobomb-state probabilities exactly
      once.
    Because each prediction is made using classifiers trained on disjoint data, the predictions will be unbiased.
    Thus, each sample in the dataset has a match-state and photobomb-state probability.
    Therefore, we can use all sample pairs to evaluate the performance of each classifier.

    In our experiments we compare these predicted match-state probabilities to the scores generated by LNBNN.
    In order to do this, we must generate an LNBNN score for each pair in our sample.
    This is done by first taking the unique set of annotations in the sample of pairs.
    Then, we use these annotations as a database.
    We issue each annotation as a query, taking care to ignore self-matches.
    Any (undirected) pair in our dataset that appears as a query / database pair in the ranked list is assigned
      that LNBNN score.
    If the same pair is in two ranked lists, then the maximum of the two scores is used.
    Any pair that does not appear in any ranked list (because LNBNN failed to return it) is implicitly given a
      score of zero.
    Note that the scores from the LNBNN ranking algorithm can only be used to distinguish positive cases from
      non-positive pairs.
    Unlike the match-state classifier, the LNBNN scores cannot be used to distinguish non-positive cases as
      either negative or incomparable.
    Later in \cref{chap:graphid}, it will be vitally important to distinguish these cases, but for now, in order
      to fairly compare the two algorithms, we only consider positive probabilities from the match-state
      classifier.

    We have now predicted match-state probabilities, photobomb-state probabilities, and one-vs-many LNBNN scores
      for each pair in our dataset.
    In the next subsections we will analyze the predictions of each classifier.
    For both the match-state and photobomb-state classifiers we will measure the raw number of classification
      errors and successes in a confusion matrix.
    We will then use standard classification metrics like precision, recall, and the Matthews correlation
      coefficient to summarize these confusion matrices.
    We will also inspect the importance of each feature dimension of our pairwise feature vector as measured
      during random forest learning.
    For each classifier we will present several examples of failure cases to illustrate where improvements are
      can be made.
    Additionally, we will compare the match-state classifier to LNBNN in two ways.
    First, we will compare the original LNBNN ranking against a re-ranking using the positive probabilities.
    Second, we will compare the ability of LNBNN and the match-state classifier to predict if a pair is positive
      or not by looking at the distribution of positive and non-positive scores as well as the ROC curves.

    \FloatBarrier{}
    \subsection{Evaluating the match-state classifier}

        The primary classifier predicts the matching state (positive, negative, incomparable) of a pair of
          annotations.
        Each pair of annotations is assigned a probability for each of these states, and those probabilities sum
          to one.
        In this context we classify a pair as the state with the maximum predicted probability.
        However, in practice we will choose thresholds for automatic classification where the false positive rate
          is sufficiently low.

        From these multiclass classifications we build a confusion matrix for plains and Grévy's zebras.
        These confusion matrices are shown in \cref{tbl:ConfusionMatch}.
        In \cref{tbl:EvalMetricsMatch} we summarize the information in each confusion matrix by computing the
          precision, recall, and Matthews correlation coefficient (MCC)~\cite{powers_evaluation_2011} for each
          class.
        The MCC provides a measurement of overall multiclass classification performance not biased by the number
          of samples contained in each class.
        An MCC ranges from $+1$, indicating perfect predictions, to $-1$, indicating pathological inverse
          predictions, and $0$ represents random predictions.
        The measured MCC of $0.83$ for plains zebras and $0.91$ for Grévy's zebras, indicates that our
          match-state classifiers have strong predictive power.

        \ConfusionMatch{}

        \EvalMetricsMatch{}

        \FloatBarrier{}

        In addition to being strong predictors of match state, we show that the positive probabilities from our
          classifiers can be used to re-rank the ranked list produced by LNBNN.
        Using the procedure from our experiments in~\cref{sec:rankexpt}, we issue each annotation in our testing
          set as a query and obtain a ranked list.
        Using the fraction of correct results found at each rank, we construct a cumulative match characteristic
          (CMC) curve~\cite{decann_relating_2013}.
        We denote this CMC curve as \pvar{ranking}.
        Then we take each ranked list and compute match-state probabilities for each top ranked pair of
          query/database annotations.
        We use the positive probabilities to re-rank the lists, and then we construct another CMC curve
          corresponding to these new ranks.
        This re-ranked CMC curve is denoted as \pvar{rank+clf}.
        The results of this experiment --- illustrated in~\cref{fig:ReRank} --- clearly demonstrate that the
          number of correct matches returned at rank $1$ is improved by re-ranking with our pairwise classifier.

        \ReRank{}
        
        \FloatBarrier{}
        \subsubsection{Binary positive classification}
        Although the accuracy of multiclass predictions is important, the most important task in animal
          identification is to determine when a pair of annotations is positive and when it is not.
        Therefore, we design an experiment that tests how well our match-state classifier can distinguish
          positive pairs from other cases.
        We will compare our learned classifiers to a baseline method that simply uses the LNBNN scores.
        %Ideally our learned classifiers will result in superior separation when compared to using just the LNBNN
        %  scores.
        We begin this comparison by illustrating histograms of LNBNN scores and positive pairwise probabilities
          for positive and non-positive cases in \cref{fig:PositiveHist}.
        The advantages of the pairwise scores are immediately noticeable.
        The pairwise scores provide a superior separation between positive and non-positive cases.
        Furthermore, the pairwise scores range from zero to one, which makes them easier to interpret than the
          unbounded LNBNN scores.

        We can make a more direct and precise comparison by considering both LNBNN scores and positive pairwise
          probabilities as binary classifiers.
        In this context, we can measure the separability of each method using the area under an ROC curve.
        The ROC curves comparing the LNBNN scores and the learned probabilities are illustrated
          in~\cref{fig:PositiveROC}.
        In all cases the learned AUC is significantly better than the AUC of LNBNN .
        For plains zebras, the pairwise AUC is $0.97$ and the LNBNN AUC is $0.82$.
        For Grévy's zebras the pairwise AUC is $0.99$, and the LNBNN AUC is $0.89$.
        These experiments clearly demonstrate that the pairwise classifier outperforms LNBNN in this
          classification task.

        In addition to illustrating the effectiveness of our classifiers when classifications are made for all
          samples, the ROC curves --- which plot the true positive rate and the false positive rate as a function
          of a threshold --- demonstrate that an operating point can be chosen to automatically classify a
          significant number of positive pairs while making very few mistakes.
        For plains zebras, a true positive rate of $0.25$ results in $4146$ true positives and only $38$ false
          positives.
        For Grévy's, an operating point with a true positive rate of $0.5$, results in $2501$ true positives and
          only $28$ false positives.
        Therefore, by carefully selecting an operating point we can bypass a significant number of manual reviews
          without making a significant number of mistakes when adding new annotations to our dataset.


        \PositiveHist{}

        \PositiveROC{}

        \FloatBarrier{}
        \subsubsection{Feature importance}

        To better understand which feature dimensions are the most useful for classification, we measure the
          ``mean decrease impurity'' (MDI)~\cite{louppe_understanding_2014} of each feature dimension.
        The MDI is a measure of feature importance that is computed during the training phase.
        As each decision tree is grown, for each node, we record the number of training samples of each class
          that reach it.
        This is used to compute the impurity of each node, \ie{} the entropy of class labels.
        Each node is weighted using the fraction of total samples that reach it.
        The weighted impurity decrease of a node is its weighted impurity minus the weighted sum of its
          children's impurity.
        The MDI for a single feature dimension in a single tree is computed as the weighted impurity decrease of
          all nodes using that feature.
        The overall MDI for the forest is obtained by averaging over all trees.


        Using the MDI we ``prune'' the dimensions of our sample feature vectors by removing the least important
          dimensions.
        We measure the effect of pruning on classification accuracy using a greedy algorithm.
        First we learn a random forest on a training set and compute the MCC on a test set.
        Then we find the feature dimension with the lowest MDI and remove it from the dataset.
        We repeat these two steps until there is only a single feature dimension remaining.

        The impact of pruning on classification accuracy is illustrated in~\cref{fig:MatchPrune}, where we plot
          the MCC as a function of the number of feature dimensions remaining.
        The results of this experiment indicate that there is a small increase in classification accuracy from
          pruning features dimensions.
        We find that reducing the number of feature dimensions to $25$ increases the MCC by $0.0155$ for plains
          zebras, and using $61$ dimensions increases the MCC by $0.0048$ for Grévy's.
        Note that once the number of feature dimensions falls below ${\sim}20$ the performance starts to degrade
          and suffers a harsh drop at ${\sim}10$ dimensions.
        This suggests that these top features are the most important to making a match-state prediction.

        The numeric importance of these top $10$ pruned feature dimensions is reported in
          \cref{tbl:ImportantMatchFeatPrune}.
        For both species we find that the most important features are statistics involving the ratio score.
        These statistics indicate the distinctiveness and similarity of a pair of annotations.
        Statistics about the spatial verification error, which signifies when the matches are not well aligned,
          are also important for both species.
        For plains zebras, the global viewpoint is important because the dataset contains incomparable examples.

        Even though we have shown that a small improvement can be made by pruning to only the most important
          feature dimensions, we choose to use full $131$ dimensions in the remainder of our experiments because
          the overall performance gain is small.

        \MatchPrune{}

        \ImportantMatchFeatPrune{}

        \FloatBarrier{}

        \subsubsection{Failure cases}

        Lastly we investigate several causes of failure.
        The primary reasons that cause the match-state classifier to fail are similar to those that create
          difficulty for the ranking algorithm.
        These were previously discussed in~\cref{sub:exptfail}.
        Factors such as viewpoint, quality, and occlusion are inherently challenging because they reduce the
          similarity between matching descriptors, and increase the disparity between the annotations.
        This makes it difficult to correctly establish and spatially verify feature correspondences.
        Photobombing animals and scenery matches also pose a challenge to the match-state classifier, often
          causing it to produce ambiguous probabilities.

        We separate failure cases into three main categories:
        (1) failure to classify a pair as positive,
        (2) failure to classify a pair as negative, and
        (3) failure to classify a pair as incomparable.
        The examples in~\cref{fig:PairFailPN} illustrate the first and most important failure case category.
        Due to occlusion, quality, and viewpoint, the established correspondences were not distinctive enough for
          the classifier to confidently predict positive.
        However, in all but the last case the probabilities are ambiguous, which implies that improvements could
          be made to fix these failures.
        In the last case, only a few distinctive matches were made, which suggests that procedure that
          establishes feature correspondences could be improved.

        Examples of the second case are illustrated in~\cref{fig:PairFailNP}.
        These examples were incorrectly classified as positive, even though they are negative.
        In two cases this is due to scenery matches and photobombing animals.
        In the other cases the pairwise classifier matches similar regions of the animal, but it is unable to key
          in on the strong negative evidence provided by different distinctive patterns on corresponding body parts
          of the animal.
        In the future, algorithms powered by convolutional neural networks may be able to take advantage of this
          strong negative evidence.
        For the third case, it is not surprising that the examples in~\cref{fig:PairFailIN} were not labeled as
          incomparable because only a small amount of incomparable training data was available.
        Furthermore, photobombs and scenery matches also seem to cause a problem for incomparable examples.
        Note that in the majority of all three types of failure, the examples have non-extreme probabilities
          assigned to each state.
        This demonstrates that the classifier is not confident in these failed predictions.
        %We will take advantage of this in the next chapter.

        Interestingly, inspection of the failure cases revealed several labeling errors in the database.
        The examples illustrated in ~\cref{fig:MatchLabelErrors} all show pairs of annotations with non-positive
          labels that should have been labeled as positive.
        Note that the positive probabilities from two of these examples are very close to $1.0$, indicating that
          the classifier is confident that these ground truth labels are incorrect.
        Furthermore, these success cases were found in the context of noisy ground truth labels, showing that our
          classifier is robust to errors in ground truth labels.

        \PairFailPN{}

        \PairFailNP{}

        \PairFailIN{} 

        \MatchLabelErrors{}

        \FloatBarrier{}


    %---------
    \FloatBarrier{}
    \subsection{Evaluating the photobomb-state classifier}

        In this subsection we perform a set experiments --- similar to those used to evaluate the match-state
          classifier --- to demonstrate the effectiveness of our photobomb-state classifier.
        Because there are only a few photobomb training examples, the random forest is not able to learn strong
          probabilities.
        In the initial design of these experiments, a pair was classified as a photobomb if its probability was
          greater than $0.5$, but we found that this caused the MCC of Grévy's zebras to drop to $0.0$.
        However, because this is a binary classification problem we can choose any threshold as an operating
          point and classify a pair as a photobomb if its probability is above that threshold and as not a
          photobomb otherwise.
        Therefore, for each dataset, we choose a photobomb-state probability threshold to maximize the MCC as
          illustrated in \cref{fig:PBThreshMCC}.
        We classify a pair as a photobomb if its probability is above $0.13$ for plains zebras and $0.17$ for
          Grévy's zebras.
        Using these adjusted thresholds, the overall performance measured using the classification confusion
          matrices in \cref{tbl:ConfusionPhotobombII} and evaluation metrics in \cref{tbl:EvalMetricsPhotobombII}.

        \PBThreshMCC{}

        \ConfusionPhotobombII{}
        \EvalMetricsPhotobombII{}

        Due to the small amount of available training data, the performance of the photobomb-state classifier is
          weaker than the match-state classifier.
        By choosing the appropriate thresholds we achieve an MCC of $0.34$ for plains zebras and $0.40$ for
          Grévy's zebras.
        These scores indicate that each photobomb-state classifier has weak but significant predictive power.
        While these MCCs are not overwhelmingly strong, they do demonstrate that each classifier is able to learn
          from only a few labeled training examples, and it seems likely that the MCCs would significantly improve
          with more labeled training data.
        From these measurements we can conclude that the photobomb-state classifier is learning.

        \FloatBarrier{}

        Our conclusion that the photobomb-state classifier is learning is supported the top $10$ most important
          features as measured using the MDI.
        It makes intuitive sense that the important features --- illustrated in~\cref{tbl:ImportantPBFeat} ---
          would be the ones selected by the random forest learning algorithm.
        Because pairs of annotations taken around the same place and time are more likely to be photobombs, each
          random forest places the most weight on global feature dimensions such as time delta, GPS delta, and
          speed.
        The classifiers also makes use of the spatial position of the feature correspondences, which can be
          indicative of a photobomb (\eg{} when all the matches are in the top left corner of an annotation).
        Because the random forest seems to be selecting reasonable features, the main source of weakness is
          likely due to the amount of training data.

        \ImportantPBFeat{}

        We gain further insight into the photobomb-state classifier by considering several failure cases
          examples, which are illustrated in \cref{fig:PBFailures}.
        In each example we observe that the classifier is either confused or it does not have enough information
          to label a pair as a photobomb.
        Furthermore, the classifier was able to find several new photobomb pairs that were mislabeled in the
          database.
        We illustrate several of these mislabeled cases in~\cref{fig:PBLabelErrors}.
        Notice that many of the probabilities predicted for these cases are well above the thresholds.
        Predicting high probabilities for these undiscovered photobomb cases indicates that the photobomb-state
          classifier is stronger than our measurements suggest.
        These mislabeled ground truth cases also help explain why the predicted probabilities are so low; the
          learning algorithm is encouraged to incorrectly classify them.
        Reviewing and relabeling all of these cases would improve results by both removing noise from the
          training set and increasing the number of labeled examples.

        \PBFailures{}

        \PBLabelErrors{}

    \FloatBarrier{}
    \subsection{Classifier experiment conclusions}
        In these experiments we have demonstrated that our pairwise match-state classifier is able to reliably
          separate positive from negative and incomparable cases.
        When making automatic decisions, these probabilities have several advantages over the LNBNN scores
          from~\cref{chap:ranking}.
        Not only do they have more predictive power, they are interpretable, and they always range between $0$
          and $1$.

        We will use these classifiers to make automatic decisions about pairs of annotations where the
          match-state probability is above a threshold.
        Our experiments have shown that it is possible to select a threshold where the false positive rate is
          sufficiently low.
        Furthermore, the classifier is able to improve the ranking algorithm by re-ranking its results.
        Lastly, because the classifier predicts probabilities independent of their position in the ranked list,
          it can be used to determine when a query individual is new --- \ie{} does not have a correct match in the
          database.

        The performance of secondary photobomb classifier is weaker, but this is likely due to a small amount of
          training data.
        Even in its weak state, it can be used to prevent automatic review of some photobomb cases by adjusting
          the classification threshold to be less than $0.5$.
        Because the photobomb classifier can only prevent automatic decisions and does not make them, the cost of
          including it in our algorithms is small, and by doing so we will increase the amount of labeled training
          data from which a stronger photobomb classifier can be bootstrapped.


\section{SUMMARY OF PAIRWISE CLASSIFICATION}\label{sec:pairconclusion}

    In this chapter we have constructed a verification mechanism that can predict the probability that a pair of
      annotations is positive, negative, or incomparable.
    We have also constructed a secondary classifier that can predict when --- namely in the case of photobombs
      --- a pair of annotations might confuse the primary match-state classifier.
    This was done by constructing a feature vector that contains matching information about a pair of
      annotations.
    We have constructed a representative training set by selecting hard, moderate, and easy training examples.
    We used the random forest learning algorithm to train our classifiers.
    Our experiments demonstrate that the match-state classifier is able to strongly separate positive and
      negative cases.
    The performance of the photobomb-state classifier was weaker, but could likely be improved with more training
      data.

    Based on our experiments, it is clear that the ranking algorithm is improved by an automatic verifier, but by
      themselves ranking and verification are not enough to robustly address animal identification.
    There is no mechanism for error recovery, nor is there a mechanism for determining when identification is
      complete.
    This means the operating point for the automatic review threshold must be set conservatively to avoid any
      errors, which results in more work for a human reviewer.
    These issues are addressed in \cref{chap:graphid} using a graph-based framework to manage the identification
      process.
    This framework will use graph connectivity to further reduce the number of required manual reviews, detect
      when errors have occurred, recover from the errors, and stop the identification
      process in a timely manner.