Estimation of Vehicle Orientations using Histogram-based Image … · 2016. 1. 25. · Ste en...
Transcript of Estimation of Vehicle Orientations using Histogram-based Image … · 2016. 1. 25. · Ste en...
Institut fur Angewandte Informatik
und Formale
Beschreibungsverfahren
Estimation of Vehicle Orientations using
Histogram-based Image Descriptors
Masterarbeit
von
Steffen Kirres
an der Fakultat fur
Wirtschaftswissenschaften
in dem Studiengang
Informationswirtschaft
eingereicht am 14. Januar 2016 beim
Institut fur Angewandte Informatik
und Formale Beschreibungsverfahren
des Karlsruher Instituts fur Technologie
Referent: Prof. Dr. Rudi Studer
Betreuer: Prof. Dr. Brahim Chaib-draa
KIT – Universitat des Landes Baden-Wurttemberg und
nationales Forschungszentrum in der Helmholtz-Gemeinschaft
Eidesstattliche Erklarung
Ich versichere hiermit wahrheitsgemaß, die Arbeit und alle Teile daraus selbstandig
angefertigt, alle benutzten Hilfsmittel vollstandig und genau angegeben und alles ken-
ntlich gemacht zu haben, was aus Arbeiten anderer unverandert oder mit Abanderung
entnommen wurde.
Karlsruhe, den 14. Januar 2016 Steffen Kirres
CONTENTS v
Contents
List of Figures ix
List of Tables xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Learning Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Prediction Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Driver Assistance Systems . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Orientations of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Survey on Orientation Estimators for Vehicles . . . . . . . . . . . . . . 10
3 Histogram-based Image Descriptors 13
3.1 Histograms of Oriented Gradients . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Image Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Histogram Calculation . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Histograms of Sparse Codes (HSC) . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Patch Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . 17
vi CONTENTS
3.2.3 Sparse Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.5 Dimensionality of HSC . . . . . . . . . . . . . . . . . . . . . . . 22
4 Estimation of Vehicle Orientations 23
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . . . . 24
4.1.3 Combined Classification and Regression . . . . . . . . . . . . . 25
4.2 Classifier Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Weighted Voting . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Specialized Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.3 Second Layer Probability Classification . . . . . . . . . . . . . . 29
4.3 Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Under-sampling Data Points . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Over-sampling Data Points . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Mirroring Images . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.4 Changing the Value Distribution . . . . . . . . . . . . . . . . . 35
5 Experiments 37
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 KITTI Vision Dataset . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Orientation Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Histogram-based Features . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Regression vs. Classification . . . . . . . . . . . . . . . . . . . . 47
CONTENTS vii
5.2.3 Results for Classifier Modifications . . . . . . . . . . . . . . . . 51
5.3 Joint Object Detection and Orientation Estimation . . . . . . . . . . . 52
5.3.1 Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 Submission Results . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Conclusion & Future Work 59
References xiii
LIST OF FIGURES ix
List of Figures
1 Detected vehicles from forefield image. . . . . . . . . . . . . . . . . . . 9
2 Pascal3D vehicle pose definition. . . . . . . . . . . . . . . . . . . . . . . 10
3 Gradient visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Example image with associated histogram of gradients. . . . . . . . . . 14
5 Car image with associated HOG per cell. . . . . . . . . . . . . . . . . . 15
6 Sliding through cells to create blocks. . . . . . . . . . . . . . . . . . . . 16
7 Patch preprocessing steps. . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 Learned dictionary of frequent patches. . . . . . . . . . . . . . . . . . . 19
9 Reconstruction with dictionary elements. . . . . . . . . . . . . . . . . . 19
10 Sliding through the image to extract image patches. . . . . . . . . . . . 21
11 Construction of the HSC feature. . . . . . . . . . . . . . . . . . . . . . 21
12 Factors deciding the feature length of HSC. . . . . . . . . . . . . . . . . 22
13 Angle representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
14 Angles divided into 16 classes. . . . . . . . . . . . . . . . . . . . . . . . 25
15 Process of combined method. . . . . . . . . . . . . . . . . . . . . . . . 26
16 Images per class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
17 Quasi-duplicates of consecutive frames. . . . . . . . . . . . . . . . . . . 32
18 Image mirroring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
19 Sensors on recording car [GeLU12]. . . . . . . . . . . . . . . . . . . . . 38
20 Exemplary forefield images. . . . . . . . . . . . . . . . . . . . . . . . . 39
21 Evaluation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
22 Cropping cars using the provided annotations. . . . . . . . . . . . . . . 42
23 HOG results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
x LIST OF FIGURES
24 Examples of badly predicted images. . . . . . . . . . . . . . . . . . . . 44
25 Neighborhood of badly predicted image. . . . . . . . . . . . . . . . . . 45
26 HSC results with 256 dictionary elements. . . . . . . . . . . . . . . . . 45
27 Dictionary length vs. accuracy vs. time. . . . . . . . . . . . . . . . . . 46
28 Comparison of regression results. . . . . . . . . . . . . . . . . . . . . . 48
29 Tuning the cost parameter for the linear kernel. . . . . . . . . . . . . . 49
30 Comparison of regression and classification. . . . . . . . . . . . . . . . . 50
31 Results of combined classification and regression. . . . . . . . . . . . . 51
32 Joint detection and estimation orientation. . . . . . . . . . . . . . . . . 53
LIST OF TABLES xi
List of Tables
1 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Responses of one-vs-one classifiers. . . . . . . . . . . . . . . . . . . . . 6
3 Multi-Class approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Explanation of dictionary learning variables. . . . . . . . . . . . . . . . 18
5 Exemplary angles and orientations. . . . . . . . . . . . . . . . . . . . . 23
6 Example for weight matrix. . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Responses of six-class problem. . . . . . . . . . . . . . . . . . . . . . . 28
8 Responses of specialized classifiers. . . . . . . . . . . . . . . . . . . . . 28
9 Probability predictions of first layer classifiers. . . . . . . . . . . . . . . 29
10 Characteristics of methods to balance classes. . . . . . . . . . . . . . . 34
11 Difficulties of the vehicle images. . . . . . . . . . . . . . . . . . . . . . 39
12 Comparison of HOG and HSC. . . . . . . . . . . . . . . . . . . . . . . 47
13 Comparison of regression and classification methods. . . . . . . . . . . 51
14 Comparison of classifier modifications. . . . . . . . . . . . . . . . . . . 52
15 Results of trained detector. . . . . . . . . . . . . . . . . . . . . . . . . . 55
16 Results with provided detection. . . . . . . . . . . . . . . . . . . . . . . 56
17 Comparison of orientation estimators. . . . . . . . . . . . . . . . . . . . 56
INTRODUCTION 1
1 Introduction
Researchers have already worked on the development of autonomous cars for many
years. One of the first larger events in this field was the DARPA Grand Challenge in
2004 [DARP04]. None of the teams finished the race but since then a lot of progress
has been made. Nowadays, a lot of researchers from academia, traditional car makers
and IT companies are working on autonomous cars [Hars15]. Recently, success stories
from varying researchers drew public interest onto these topics (for instace [Hsu15]).
On the one hand, autonomous cars driving interconnected with other autonomous cars
can be beneficial for the environment because they could lead to a more constant flow
of traffic and therefore preventing jams and reducing carbon emissions [ITSA14]. On
the other hand, a main cause of accidents is human failure [Sing15]. The reasons for
failure range from fatigue, alcohol consumption to distraction caused by the use of
electronic devices which poses a growing problem [Worl15]. Autonomous cars are not
getting tired or lose their focus and can hence be valuable to reduce the amount of
accidents.
An autonomous vehicle consists typically of a multitude of subsystems and is equipped
with different sensors such as cameras or radars. Each subsystem should be able to work
independently and deliver reliable information. To ensure a failsafe system – even in
case of a failure of one subsystem – the information from the single subsystems should
be fused together. Thus, the redundant information can enable a reliably working
system.
1.1 Motivation
One of these subsystems must be aware of the surrounding cars at any given time.
Not only the current position is of importance, but also the system should be able
to predict the positions of the surrounding cars in the next moment. Knowing where
surrounding cars will be located can be a useful information not only in the context
of autonomous cars. If the technology is embedded in a breaking assistance system, it
can then inform the human driver of potentially dangerous situations.
Nowadays, breaking assistants detect other vehicles through distance sensors. These
sensors however are directed to the front only and warn the driver in the case that a
vehicle is too close in the direct forefield of the car. Vehicles, which are not yet directly
in front of another vehicle but are about to cross this vehicle’s driving lane, are not
detected. For instance, a car which is changing the lane directly in the immediate
2 INTRODUCTION
forefield might be detected too late by a conventional breaking assistant potentially
resulting in an accident.
A breaking assistant which can predict the future position of a vehicle is able to warn
the driver just a split second earlier giving the human driver or the autonomous system
enough time to hit the breaks in order to prevent a crash. Employing such a breaking
assistant can reduce the risks which arise from unforeseen lane changes. This kind
of breaking assistant needs to incorporate the orientation as well as the velocity and
distance to the surrounding vehicles.
1.2 Objectives
The goal of this thesis is to develop a model which can reliably predict the orientation
of a vehicle. This model is based on vision and uses only color images. In this context,
different image descriptors and prediction approaches shall be investigated, compared
and improved.
Experiments on an existing data set shall show that the orientation estimation is work-
ing not only for cropped car images but also in conjunction with any independent object
detector. This detector has to inform the orientation estimator only about the location
of a vehicle in a recorded image.
1.3 Thesis Organization
Section 2 gives an overview of the subject and includes a survey on previous work
for the estimation of vehicle orientations. In Section 3, the fundamentals for two im-
age descriptors, which are employed throughout this thesis, are explained. Different
approaches for solving the prediction of vehicle orientations are given in Section 4.
Problems arising from the used data set as well as modifications of existing classifica-
tion approaches are described in detail in that section as well. Section 5 is reporting
the extensive experiments which were conducted for this work and also a detailed de-
scription of the data set is given. In Section 6 the work is concluded by a summary of
the results and some remarks for future work.
RELATED WORK 3
2 Related Work
This section gives a brief overview on several related topics. Firstly, the terminology
and different methods in machine learning are introduced. Secondly, it is discussed
which information on surrounding cars are useful for driver assistance systems. Fur-
thermore, two ways of describing the orientation of a vehicle are given and a survey on
orientation estimators for vehicles is conducted.
2.1 Machine Learning
As this work is using machine learning techniques, this section is briefly describing the
different learning styles, tasks and approaches related to this topic. The focus is only
upon those parts which are used in the remainder of this work.
2.1.1 Learning Styles
In machine learning, problems can be differentiated according to the style of learning
which is used to solve a certain problem. The differentiation can be made according to
whether there is an involvement of a human present or not.
2.1.1.1 Supervised Learning. In this learning style, a supervisor defines classes
and labels of training data, therefore it is called supervised [MaRS08]. With the ac-
cordingly labeled training data, a model shall be learned which can make an accurate
prediction [Barb10]. It is particularly of interest that this model predicts well unseen
data.
The typical prediction tasks which belong to this style are classification and regression
(see Section 2.1.2).
2.1.1.2 Unsupervised Learning. A different learning style which focuses on find-
ing an accurate compact description of data is unsupervised learning [Barb10]. For this,
no labeled data is used and it is also not necessary because the goal is to find out how
the data is organized or clustered [HaTF09].
According to [MaRS08], the most common form of unsupervised learning is clustering.
In this task, areas with an accumulation of data points need to be identified, so that
the points can be grouped into clusters.
4 RELATED WORK
2.1.2 Prediction Tasks
Another way to differentiate is to consider what kind of task is solved to reach a certain
goal. In this work, a model has to be learned which can predict a value given an unseen
data point. Therefore only classification and regression tasks are described here. The
main difference between these two tasks lies in the output they produce.
2.1.2.1 Classification. In a classification task, the learned model predicts a dis-
crete value or a class [Barb10]. Also, the training examples are labeled with such a
discrete value or class. If the labels only contain two distinct classes, it is referred to as
binary classification [SmVi08]. The prediction therefore only answers whether a data
point belongs to one or the other class. If more than two classes exist, it is referred to
as multi-class classification [SmVi08] (see Section 2.1.2.2).
The prediction of a classifier can be either correct or incorrect, but the incorrect predic-
tions can further be differentiated into false positives and false negatives. It is typically
illustrated in a confusion matrix (see Table 1). In the depicted example, a false positive
occurs when a data point actually belongs to class B but is predicted as an instance of
class A. In the case that a data point belongs to class A but is predicted as class B, it
is called a false negative. The correct predictions are differentiated into true positives
and true negatives.
Table 1: Confusion matrix.
Truth
Class A Class B
PredictionClass A true positives false positives
Class B false negatives true negatives
This differentiation is necessary because in some cases a false negative can be a lot more
severe than a false positive. This can be explained by means of an earthquake warn-
ing system. The outcome of predicting “no earthquake” when in fact an earthquake
occurred is a lot worse than predicting an earthquake when in reality there has not
been an earthquake. The first case is a false negative, the second one a false positive.
When a false negative occurs, people would not take the necessary precautions which
could have fatal effects for them. False positives do not have this kind of bad outcome
in this particular scenario.
It is possible to associate costs for false negatives and false positives when learning a
classifier. This is referred to as misclassification cost [Elka01] and has been studied for
different models, [FSZC99] for AdaBoost, [KoAl01] for SVM and [Ting02] for decision
RELATED WORK 5
trees. In the earthquake scenario, the cost associated with false negatives should be
a lot higher than for the false positives. It results in a more conservative classifier
which is more likely to predict false positives, however, the number of false negatives
is reduced, ideally to zero. The property of conservatism is used in Section 4.2.2.
2.1.2.2 Multi-Class Classification. In the presence of more than two classes, the
afore mentioned classification task can be extended by using multiple binary classifiers.
Multi-Class Classification can be achieved by methods like one-vs-one or one-vs-rest.
Some more notes on these methods are following in this section, however, a detailed
investigation concerning these methods is not conducted in this work but can be found
in [HsLi02].
A. The one-vs-rest Approach. In this approach, all data points are divided into
two sets. One set which contains only the data points from one class while the other
contains the leftover data points from the other class [Bish06]. These two sets are con-
sidered as the new two classes. Then, a classifier is learned which is able to differentiate
between these two classes and the response of the classifier is a confidence value. This
value tells how certain the classifier is that the predicted class was seen.
This division is being done multiple times, so that every class was once in the small
subset. Assuming that the classes 1,2 and 3 are present, three classifiers have to be
learned: 1-vs-(2 and 3), 2-vs-(1 and 3) and 3-vs-(1 and 2). The final prediction can be
done by using the prediction with the highest confidence value.
A problem which can arise from this approach is that when grouping two classes to-
gether, the amount of data points in this grouped class can be a lot larger than the
amount of data points in the single class. It results in class imbalance which is also
addressed in Section 4.3. Furthermore, the confidence values of every single classifier
is scaled differently and not directly comparable which can be a drawback. [LiZh05]
introduces a method which can overcome this issue.
B. The one-vs-one Approach. For this approach only a subset of the complete
dataset is considered in the learning stage. This subset contains all data points from
only two classes, it can then be considered a binary classification problem. A classifier is
learned which can discriminate between these two classes. This procedure is repeated
for all possible combinations of class pairs. As a consequence, for the one-vs-one
approach k ∗ (k− 1)/2 classifiers have to be learned where k is the amount of different
classes. Assuming that four classes exist, thus k = 4, it follows that 6 classifiers need
to be learned.
6 RELATED WORK
When a prediction has to be made with the one-vs-one approach, the new data point
is handed to every single classifier created earlier. The responses of the classifiers can
be taken from Table 2, for instance, the 1-vs-2 classifier is responding that an example
of class 2 has been detected. The final decision, which class to use, can be made by
the majority rule described in the following section.
Table 2: Responses of one-vs-one classifiers.
Class
1 2 3 4
Class
1 - 2 3 4
2 - - 2 2
3 - - - 4
4 - - - -
C. Majority Voting. This is a popular approach not only in democratic procedures.
Even though it is a simple rule, it was shown that good results can be achieved with it
[LaSu97]. The rule says that the class which has the most accumulated votes is chosen
as the final prediction. In the example given in Table 2, class 2 would be chosen as
final prediction because it has three votes. Class 1 has no votes, class 3 has only one
vote and class 4 has two votes.
In the case that two or more classes have the same amount of votes, the class with
the smallest index can be taken. Instead of applying the smallest index rule, it is also
conceivable to choose one class among the classes with the most votes at random.
Table 3 gives an overview of the explained approaches for Multi-Class Classification
problems.
Table 3: Multi-Class approaches.
Approach Classifiers Decision Rule
One-vs-rest k Highest confidence value
One-vs-one k ∗ (k − 1)/2 Majority rule plus smallest index
2.1.2.3 Regression. The goal of the regression is to predict a continuous value
[SmVi08] whereas classification predicts a discrete or categorical value. Based on the
labeled training data, the regression model is learned, therefore it is also a supervised
technique. The learned regression model needs a vector as an input and outputs one
value. This value is the prediction of the model.
RELATED WORK 7
Different approaches exist to learn a model. In statistics, a very common way to
determine a regression function is to fit a linear regression model with the least square
method [MoPV12]. Other popular approaches in the area of machine learning are
described in the following section.
2.1.3 Algorithms
For classification and regression tasks, a plethora of different approaches and algorithms
exist. Here, only the ones, which are used in the experimental part of the work, are
briefly introduced.
2.1.3.1 Support Vector Machines (SVM). This method dates back to a work
from the sixties, they are described in [Vapn63]. With SVM a decision boundary, or
hyperplane, can be learned which separates data points. To ensure that this boundary
is generalizing well for unseen data, [Vapn63] suggests that the margin between the
boundary and the closest data points shall be maximized.
The problem formulation for a binary classification problem as described in [CoVa95]
can be taken from Formula (1). The xi’s are the feature vectors, the yi’s contain the
class assignment, and ξi’s are slack variables. Slack variables are necessary to make the
problem solvable when the dataset is not linearly separable which is often the case. ξi
corresponds to the violation of the i-th constraint.
minw,b,ξ
1
2wTw + C
l∑i=1
ξi
subject to yi(wTφ(xi) + b) ≥ 1− ξi,
ξi ≥ 0, i = 1, . . . , l.
(1)
2.1.3.2 Support Vector Regression (SVR). As an extension of Support Vector
Machines (SVM), the SVR is used for solving regression problems. [VaVa98] introduced
a version for SVR which can be taken from Formula (2). The notation is similar to the
one used before for SVM, however for SVR the zi’s contain the target values which are
continuous.
minw,b,ξ,ξ∗
1
2wTw + C
l∑i=1
ξi + Cl∑
i=1
ξ∗i
subject to wTφ(xi) + b− zi ≤ ε+ ξi,
zi − wTφ(xi)− b ≤ ε+ ξ∗i ,
ξi, ξ∗i ≥ 0, i = 1, . . . , l.
(2)
8 RELATED WORK
The goal of the minimization problem consists of two parts. One being the minimization
of the sum of constraint violations which are multiplied by a regularization constant
C. The second goal is to minimize the quadratic norm wTw which, together with the
other constraints, corresponds to maximizing the smallest distance to the hyperplane.
Solving the above stated problem should find a function, or the support vectors which
span the function. All the points should be in the ε-area above or underneath the
function [SmVi08]. This is expressed by the two first constraints. The main difference
between SVM and SVR is that for SVM a function is learned which separates two
classes, for SVR however a function is learned which, in the best case, contains all the
data points in its ε-area [SmSc04].
2.1.3.3 Decision Trees. This method is an easy and intuitive way to determine
decisions or to group and separate data. To determine decisions, the tree consists
of nodes with conditions. Depending on whether the condition is fulfilled a different
branch has to be followed until a leaf is reached. The leaf contains the final decision.
Trees can also be used to make decisions. The data points are separated in a way
that the two new resulting datasets have a high intra-class similarity and a low inter-
class similarity. This means that instances from one dataset should be similar to the
instances of the same class but instances from different datasets should be very different
from each other. In each step of the process of building decision trees, the splitting
attribute and value has to be chosen which ensures these conditions best.
Trees were introduced for Classification and Regression tasks by [BFSO84]. In this
section, the focus is on regression trees which are able to predict values from a certain
range. However, the model can only predict values which have been taught by at least
one example in the training stage. Values with no exemplary data points are not being
predicted by the model. Thus, the predictions cannot include every single continuous
values in a certain range and are limited to a discrete yet large subset of numbers in
the range.
2.2 Driver Assistance Systems
For autonomously driving cars or driver assistance systems, it is of great importance
that all information on surrounding vehicles is available. Not only the current position
but also the future position of these vehicles constitutes an important information. To
predict the future position of a surrounding vehicle, it is necessary to know, firstly,
RELATED WORK 9
where the vehicle is currently located; secondly, in which direction the vehicle is head-
ing; and thirdly, how fast the vehicle is driving.
1. The research on object detection addresses the first issue and answers the question
where a vehicle is located at the current time. Exemplary results of an object
detector with the bounding boxes of found vehicles can be taken from Figure 1.
2. The answer on how the driving direction of a vehicle can be determined depends
on what kind of data is used. If sequential data is available, object tracking
or optical flow approaches can help solving the problem. For this, an object is
followed in successive frames and from the differences in the sequential frames
the velocity and the driving directions can be inferred. If, however, no such
sequential data is available, the direction has to be determined by only using a
single frame. Determining the orientation of a certain vehicle in a single frame is
referred to as vision-based orientation estimation.
3. For the third issue, the determination of vehicle velocities, again approaches from
the field of object tracking and optical flow can be used. Sequential images are
mandatory for these vision-based approaches. However, velocities can also be
measured using other techniques, for example, by employing a LIDAR (Light
Detection and Ranging) system which can measure distances and velocities of
the surrounding cars.
Figure 1: Detected vehicles from forefield image.
This work focusses on orientation estimation, therefore only approaches related to this
area are described in this section. A detailed survey on approaches in the areas of
object detection, tracking and behavior analysis can be taken from [SiTr13].
2.3 Orientations of Vehicles
The orientation of a vehicle in an image can be described in different ways. When
using a more detailed description, it is referred to as vehicle pose, whereas a simpler
definition is referred to as a vehicle orientation. The definitions are as follows:
10 RELATED WORK
1. In [XiMS14], the pose is described as a combination of three values: the azimuth,
the elevation and the distance (see Figure 2)1. The azimuth value implies from
which side the car is seen. The elevation angle indicates whether the vehicle
image was taken from the top or the bottom. And the distance value tells how
far away the vehicle was from the recording camera.
2. When it is only of interest which side of a car is seen, only one value, the ob-
servation angle, is necessary. The observation angle is equivalent to the azimuth
in the pose definition. For the prediction challenge which is described in Section
5.3, only the observation angle needs to be predicted and also most of the below
described related work is using this one-value definition of vehicle orientations in
their approaches.
Figure 2: Pascal3D vehicle pose definition.
2.4 Survey on Orientation Estimators for Vehicles
In a lot of existing approaches, the estimation of vehicle orientations is solved simul-
taneously while detecting the vehicle. HOG [DaTr05], Haar-like features [ViJo01] and
Gabor filter are often used together with SVM [Vapn63] or AdaBoost [FrSc97] for the
detection and therefore also for the orientation estimation. Different approaches which
are related to the estimation of vehicle orientations are covered below.
In [RHMH10], a LIDAR is used to detect a vehicle candidate, HOG features are then
being computed for the candidate. The orientation is determined with a multi-class
SVM but only eight distinct viewpoints are used. In [NPSB10], the authors are de-
tecting the lights of the surrounding vehicles. By tracking the changes in the geometry
of the lights, the orientation of a vehicle can be estimated. As tracking is involved,
sequential frames from a video are necessary.
1http://cvgl.stanford.edu/projects/pascal3d.html [Accessed December 18, 2015]
RELATED WORK 11
[GLOH11] produces two bounding boxes for each detected vehicle. The outer bounding
box comprises the whole extent of the vehicle, the inner bounding box only includes the
corresponding rear or front section, e.g. from left to the right taillight. By analyzing
the relative positions of the two bounding boxes, the vehicle orientation can be esti-
mated using a tree-based classifier. [GeWU11] is inferring a scene layout and estimate
orientations by aligning the tracklets with detected lanes. However, this approach only
works for moving cars.
[WiDe11] is learning detectors for eight distinct viewpoints by using HOG features
together with a simple linear classifier. The detector with the highest detection score
determines the orientation of the vehicle. [YTAS11] is simultaneously doing object de-
tection and orientation estimation by training a family of detectors using multiplicative
kernels, HOG features and SVM classification.
[YBAL14] is using a discriminatively trained part-based model (DPM) [FGMR10]. For
this model, parts of vehicles are detected and if certain parts occur together in a certain
way a vehicle detection can be derived. The observation angles are grouped into 16
different viewpoints and for each viewpoint a set of parts is learned. The parts are
described using HOG features. In [YeBG15], the DPM was extended to incorporate
also 3D information by using stereo images.
A DPM with HOG features is used in [PSGS13] where also occlusion of vehicles is
considered. Occlusion patterns are detected and depending on the finding of those
patterns, the detection of occluded vehicles is improved. The work is extended in
[PSGS15] where a 3D-DPM model is used for the detection, but only eight different
viewpoints are taken into account for the orientation estimation. Again, HOG features
are used to describe the parts. Occlusion is also considered in [LiWZ14] where the
configurations of parts and components are modeled by an AND-OR structure with
varying viewpoint-occlusion patterns.
In [OBTr14], a clustering on color and gradient features is performed. Detectors for
subcategories of varying orientations, occlusions and truncations are learned. The
detectors are trained using AdaBoost. This work was extended in [OBTr15], where
the detection scores of the single detectors were used as a feature to determine the
orientation.
[XCLS15] creates exemplar 3D models which vary in the observation angle, the oc-
clusion and the truncation. Then a clustering for every exemplar model is conducted
resulting in a set of images for every single exemplar model. The images from a cluster
are used to train a detector which is then able to predict typical patterns seen in the
cluster. As for every exemplar model a detector is created, multiple detectors exist. If a
12 RELATED WORK
detector triggers a detection, besides the orientation angle, also the degree of occlusion
and truncation is predicted.
Recently, Deep Learning approaches [HiOT06], [KrSH12], [SLJS15] performed very
well for computer vision tasks. An advantage of such deep networks is that no hand-
engineered features are necessary, the features are learned by the network itself. How-
ever, for the estimation of vehicle orientations not a lot of work is published yet.
[CKZB15] is learning the location of an object and its orientation simultaneously using
the Fast Region-based CNN [Girs15].
In most of the above mentioned work, the orientation is a directly inferred from the
detection and the problem of orientation estimation is not considered as an independent
problem. The advantage of solving the orientation estimation independently is that
it can be plugged onto any object detector, therefore modularity is ensured. So if,
for example, a better detector is developed, no changes to the orientation estimator
need to be made. The new detector can simply be used together with the original
orientation estimator. The developed orientation estimator in this work requires, apart
from the bounding box of the detected vehicle, no prior information from the detectors
to estimate the vehicle orientation.
HISTOGRAM-BASED IMAGE DESCRIPTORS 13
3 Histogram-based Image Descriptors
Understanding an image comprises the tasks of object detection and orientation estima-
tion as explained in the previous chapter. If a machine should be taught to execute these
tasks automatically, the image needs to be represented in a machine-understandable
way.
An image is typically represented as a 3-dimensional matrix of pixel values in the
RGB color space. Concatenating these matrices to a vector would not only result in
a large vector but also this vector would not be very suitable for the prediction task.
It is necessary to describe images in an abstract way for a machine to understand.
This representation should ideally capture all the necessary information and should
be discriminative enough to distinguish images of different kinds. The representation
of an image as a vector is called image descriptor or generally a feature vector. This
feature can be engineered (Feature Engineering), or automatically generated (Feature
Learning).
Studies have shown that the human brain detects objects by recognizing shapes and
edges [TrKa98]. The same approach is followed when teaching a machine to under-
stand an image. Thus, a plethora of different feature descriptors exist which are using
edges. There are descriptors which are keypoint-based, like SIFT [Lowe99] and SURF
[BETV08]. In this work, the focus is on descriptors which aggregate the number of
edges from different orientations. They are described in the following subsections.
3.1 Histograms of Oriented Gradients
Histograms of Oriented Gradients (HOG) [DaTr05] were introduced for the detection
of humans but became popular also for the detection of other objects. Further work
was conducted to extend and improve the performance of HOG in [ZYCA06] and
[WaHY09].
3.1.1 Image Gradients
Image gradients are measuring the change in intensity in a specific direction. For every
pixel in an image, it can be calculated in which direction the intensity and consequently
the color changes the fastest. Supposing a gray scale image, the gradient points into
the direction where the image is getting darker fastest. In Figure 3, the gradients are
visualized by red arrows. In the image, the pixel values are increasing towards the
middle, thus the gradients point towards the middle plane from both sides.
14 HISTOGRAM-BASED IMAGE DESCRIPTORS
Figure 3: Gradient visualization.
If two neighboring pixels have almost the same gray value the gradient is small. If the
neighboring pixels are black and white the gradient is large. Implicitly, this means that
an image gradient is able to capture the edges contained in an image.
3.1.1.1 Oriented Gradients. Gradients come with a direction, the gradients in
Figure 4a point either to the bottom or the top. However, it was found in [DaTr05]
that it does not necessarily improve the results if the direction is used. Rather the
plane of the gradient is of interest. For example, the gradients pointing to the bottom
or the top are both in the vertical plane and can be aggregated. If the exact directions
are being used, it is referred to as signed gradients. When the gradients are summed
up in a plane, it is referred to as unsigned gradients.
Furthermore, it was found that choosing nine of these planes or orientations is perform-
ing best. This means that the histogram consists of nine bins and each bin contains a
range of gradients. Each orientation range is 180◦
9= 20◦ wide.
(a) Stripe image. (b) Histogram of gradients.
Figure 4: Example image with associated histogram of gradients.
3.1.2 Histogram Calculation
By summing up all gradients from a specific orientation, a histogram is obtained. Each
bin of the histogram contains the amount of gradients from this orientation. E.g., using
nine orientations results in a histogram with nine bins. In the case of signed gradients,
the amount of gradients to the left, to the right, etc. is computed. In the case of
HISTOGRAM-BASED IMAGE DESCRIPTORS 15
unsigned gradients, the amount of horizontal, vertical, diagonal gradients is summed
up.
Considering the stripe image from above, all gradients are in the vertical orientation.
So, only the bin which contains the gradients from the vertical orientation has a non-
zero value. The resulting histogram can be taken from Figure 4b, the fifth bin contains
all the gradients.
It is obvious that a single histogram with nine orientations cannot be enough to describe
an image entirely because such a descriptor would lack discriminative power. The
resulting histogram would be quite similar for any natural image. To overcome this
problem the image is separated into cells.
(a) Separation into cells. (b) Histograms of each cell.
(c) Alternative representation as roses.
Figure 5: Car image with associated HOG per cell.
The car image was divided into 4∗6 = 24 cells (see Figure 5a). For every cell a histogram
of oriented gradients is being computed, resulting in 24 histograms depicted in Figure
5b. Often for HOG an alternative representation for the histograms is used, this rose-
like representation makes it easier to recognize the direction with the predominant
gradient plane, as shown in Figure 5c.
Additionally, the cell histograms are grouped into blocks, for example, one block con-
tains the histograms of four adjacent cells. Multiple blocks of cells are created by
sliding through the image cell per cell, as illustrated in Figure 6. Each block then con-
tains the concatenation of four histograms and at the end all blocks are concatenated
to one feature vector. Grouping the cells into blocks and applying this sliding window
approach increases the robustness and improves the generated HOG feature [DaTr05].
16 HISTOGRAM-BASED IMAGE DESCRIPTORS
I II III
Figure 6: Sliding through cells to create blocks.
The dimensionality of the final feature vector depends consequently of the amount of
blocks (here fifteen, five blocks horizontally and three blocks vertically), the amount
of cells in each block (here four), and the amount of histogram bins in each cell (here
nine). In the given example, the resulting feature length is 15 ∗ 4 ∗ 9 = 540.
3.2 Histograms of Sparse Codes (HSC)
This section describes an image descriptor which is related to HOG but instead of
using gradients the approach is using sparse codes. The feature is called Histograms of
Sparse Codes (HSC) and was introduced in [ReRa13]. The original work used HSC to
perform object detection like it was the case for HOG and the authors reported better
detection results for HSC compared to HOG.
The HOG features detects edges by analyzing the gradients whereas the HSC features is
detecting edges by analyzing image patches. An image patch is a small image which can
be, for example, 3x3 or 8x8 pixels large. Both approaches are using histograms, HOG
is summing up gradients from different directions, HSC is summing up the occurrence
of certain image patches. Learning these image patches is addressed in Section 3.2.2.
In this work, a modified version of HSC is being used, it is described in the following
sections.
3.2.1 Patch Preprocessing
Assuming two image patches with a horizontal edge but one of them is darker and
the other one brighter as depicted in Figure 7a. When comparing these patches pixel-
wise one could argue that most pixel values are very different from each other. Only
the horizontal edge is a similarity and in a pixel-wise comparison the impact of this
similarity might not be strong enough. The result of the comparison would be that
both images are not similar and the fact that both patches have a horizontal edge
would be missed.
This is the reason why preprocessing the image patches is necessary to make the edges
more clearly visible. [ReRa13] is calculating the mean pixel value for every patch and
HISTOGRAM-BASED IMAGE DESCRIPTORS 17
(a) Original patches. (b) Centered patches. (c) Plus normalization.
Figure 7: Patch preprocessing steps.
is subtracting it from the regarded patch. This is also called centering and depicted
in 7b. Note that centering can introduce negative pixel values and the values have to
be rescaled to valid values to visualize them. This work is, in addition to centering,
normalizing the patches which leads to more uniform patches as can be seen in 7c.
It is important that this preprocessing has to stay the same for every extracted patch
in all stages, namely the learning of the dictionary, the training of the model as well
as for the prediction of test images. Skipping or modifying the preprocessing only for
one stage can result in an undesired behavior and reduced results.
3.2.2 Dictionary Learning
The first step when working with Sparse Codes is to learn a dictionary of visual words.
These visual words are the image patches described in the preceding section. The
dictionary should include those image patches which occur most often but also have a
certain degree of representational power to distinguish them from other patches in the
dictionary [ZhLi10].
[OlFi97] introduced a dictionary learning approach which is using stochastic gradient
descent to update the dictionary in each step. [EnAH99] is learning a dictionary by
alternating between the optimization of the dictionary and the sparse codes. The
popular K-SVD algorithm [AhEB06] is updating the sparse codes while the update of
the dictionary is being conducted. [MBPS09] propose an online-approach which can
learn dictionaries fast and is capable of updating the dictionary when new images are
seen.
Finding a suitable dictionary can be formulated as an optimization problem, this work
is using the formulation described in [MBPS09], see Equation (3). Given a dictionary
D, the error ‖xi −Dαi‖22 between image patch xi and the reconstruction of this patch
18 HISTOGRAM-BASED IMAGE DESCRIPTORS
by using a subset of dictionary elements should be minimized. A detailed example for
patch reconstruction is given in Section 3.2.3.
minD∈C,α
1
n
n∑i=1
1
2‖xi −Dαi‖22
subject to ‖αi‖0 ≤ λ.
(3)
αi is a sparse vector and includes the information on which subset of dictionary elements
shall be used. If α(j)i is zero, the j-element of the dictionary is not being used. If α
(j)i
is non-zero, the j-th element of the dictionary is used with the weight included in
α(j)i . The ‖ · ‖0-norm is equal to the number of non-zero entries of a vector. Thus,
the constraint ‖αi‖0 ≤ λ ensures that αi does not contain more non-zero entries than
determined with the sparsity level λ.
Table 4 gives an overview of the beforehand described variables.
Table 4: Explanation of dictionary learning variables.
Variable Explanation
D Dictionary
C Dictionary constraint (C = {D ∈ Rm×p : ‖dj‖22 ≤ 1,∀j})xi Preprocessed i-th image patch (extracted from the original image)
‖αi‖0 Amount of non-zero entries in sparse vector αi
λ Sparsity level
This optimization problem can be solved by alternating between the computation of α
and D [ReRa13]. This means that first a random dictionary can be assumed, then the
sparse codes which minimize the problem are computed. Afterwards, the dictionary
is modified to check if a better value for the objective function can be found. This is
being done iteratively for a determined amount of iterations.
By solving the problem, the image patches with the smallest reconstruction error are
identified. Thus, the found image patches are the most frequently used patches in the
used subset of images. This subset should represent well all existing angles that exist
in the dataset. An exemplary dictionary with the most frequent 64 patches from a
subset of vehicle images can be found in Figure 8.
3.2.3 Sparse Reconstruction
After a dictionary was learned, every arbitrary image patch can be reconstructed by
using the dictionary elements. Clearly, the goodness of reconstruction depends on how
HISTOGRAM-BASED IMAGE DESCRIPTORS 19
Figure 8: Learned dictionary of frequent patches.
close the regarded image patch is to the learned elements. Most of the time it is not
an exact reconstruction but rather an approximation of the original regarded patch. It
can be formulated, again, as an optimization problem [MBPS09] which can be taken
from Equation (4).
minα
‖x−Dα‖22
subject to ‖α‖1 ≤ λ(4)
The notation differs slightly from the one used earlier. x denotes only one image
patch because the problem has to be solved for every patch individually, α is the
corresponding sparse code. Solving the problem is finding the dictionary elements with
assigned weights which are minimizing the reconstruction error. The reconstruction
error is calculated by taking the absolute difference between pairwise pixels of two
patches.
An example of patch reconstruction can be taken from Figure 9. The original patch
can be reconstructed quite accurately with three other patches. The assigned weights
are given for every patch, a negative weight means that pixel values need to be flipped
which results in black pixels becoming white and vice versa.
Original patch
Reconstruction = 0.21× −0.26× +0.54×
Figure 9: Reconstruction with dictionary elements.
The sparsity is retained by the constraint ‖α‖1 ≤ λ. Noticeable is that not the ‖ · ‖0-norm as it was the case for the dictionary learning is used but the ‖ · ‖1-norm. This
enables a more accurate reconstruction because instead of limiting the amount of non-
zero entries, the absolute values of the weights are limited. Still, this constraint leads
to a sparse vector α.
20 HISTOGRAM-BASED IMAGE DESCRIPTORS
Both, learning the dictionary and reconstructing the images can be computed by
SPAMS (SParse Modeling Software) which was introduced in [MBPS09] and is publicly
available. It provides efficiently implemented functions, is written in C++ but there
are also interfaces to MATLAB, Python and R.
3.2.4 Feature Construction
The previous sections explained the necessary basics, this section described how the
actual HSC feature is being constructed. After learning the dictionary, the following
steps have to be conducted:
1. Separate the car image into cells, as shown in Figure 5a, which is equal to the
first step in creating the HOG feature.
2. Extract all image patches from a cell in a sliding window approach. The cell
is passed through pixel by pixel, as depicted in Figure 10, and thus all image
patches are extracted.
3. For every extracted patch the sparse codes and according weights are calculated
by solving the minimization problem in Equation (4).
4. Compute the absolute value for every sparse code. This is necessary because it
is not of importance whether an occurring edge is black or white. The important
fact is that there is an edge of a certain orientation and this is preserved by taking
the absolute value.
5. Sum up the absolute values of the sparse codes for each cell. As a result, we
obtain a vector fi for each cell i which contains the frequency of each dictionary
element in this cell. For instance, it reveals that the first dictionary element is
used ten times, or that the second element is only used twice. This also explains
the naming of the HSC (Histograms of Sparse Codes) feature, because histograms
are a typical way to visualize frequency distributions.
6. Normalize all fi’s individually on a per cell basis.
7. Concatenate the normalized fi’s to obtain the final feature vector, see 11b.
Assuming that a cell only consists of two patches for simplification, two sparse codes
are the result of solving the reconstruction problem as shown in 11a. Then the ab-
solute values are computed and the vectors are summed up column by column. The
summation result is the above mentioned vector fi.
To clarify the process of the feature construction Algorithm 1 illustrates the described
steps in a clear way.
HISTOGRAM-BASED IMAGE DESCRIPTORS 21
Figure 10: Sliding through the image to extract image patches.
|(0, . . . , 0,+0.2,−0.4, 0, . . . , 0)|+|(0, . . . , 0,+0.4,+0.1, 0, . . . , 0)|= (0, . . . , 0,+0.6,+0.5, 0, . . . , 0)
(a) Summing up the absolute values.
f1‖f1‖2
⊕ · · · ⊕ fn‖fn‖2
(b) Concatenating the normalized features per cell.
Figure 11: Construction of the HSC feature.
Algorithm 1 HSC feature construction.
Require: dictionary
procedure createHSC(image)
cellImages← seperateIntoCells(image)
for each cellImage in cellImages do
imagePatches← extractAllImagePatches(cellImage)
for each imagePatch in imagePatches do
sparseCode← solveReconstruction(imagePatch, dictionary)
end for
cellSum←∑|sparseCode| . Summing up all sparse codes
normCellSum← cellSum
‖cellSum‖2end for
hscFeature←⊕
normCellSum . Concatenates all cell features
end procedure
22 HISTOGRAM-BASED IMAGE DESCRIPTORS
3.2.5 Dimensionality of HSC
The dimensionality of the HSC feature depends on the amount of cells used, as well
as the amount of dictionary elements. This follows because for every cell the HSC
feature vector contains as many elements as the dictionary. Every entry in the feature
vector states how often each dictionary element is used and can consequently not be
smaller than the dictionary itself. When, for example, a dictionary with 64 elements,
four vertical and 6 horizontal cells are used, it results in a feature length of 1536, as
shown in Figure 12.
dictionary elements
×vertical cells×horizontal cells
= feature length=⇒
64
×4
×6
= 1536
Figure 12: Factors deciding the feature length of HSC.
A comparable configuration for HOG with the same amount of cells would result in a
feature length of only 540. Section 5.2.1.2 is investigating the effect of the dictionary
length on the quality and the computation time for the predictions.
ESTIMATION OF VEHICLE ORIENTATIONS 23
4 Estimation of Vehicle Orientations
In this section, different approaches for the estimation of vehicle orientations are de-
scribed based on two different problem formulations. Modifications for predictors are
introduced which incorporate the particular structure of the problem. Furthermore,
issues arising from the imbalance of the data set are regarded and possible solutions
are given.
4.1 Problem Formulation
In Machine Learning, problems can be categorized according to the prediction value
they produce [Bish06]. If the prediction is a continuous value, it indicates a regression
problem. If a class is supposed to be predicted, it is a classification problem.
In the present case of the orientation estimation, the orientation angle of an observed
vehicle is supposed to be predicted. This angle is a continuous value in the range from
−π to π, thus indicating a regression problem. However, it was shown previously in
[OBTr15] that formulating the problem as a classification problem can be beneficial.
This section explains how to formulate the problem as a regression respectively a classi-
fication task. During the experiments for both of these formulations, it was found that
each formulation comes along with certain advantages and shortcomings (see Section
5.2.2). To keep the advantages of both methods, an approach is introduced here which
combines the regression with the classification model.
4.1.1 Regression
As mentioned above, the prediction of orientation angles can be seen as a regression
task. The model predicts one value ranging between −π to π. Table 5 shows some
exemplary angles and their assignments to orientations. A problem which arises from
this, is that angles for the left side can be either +π or −π. The angle values are
cyclical and not, as it would be normal for regression, a number line (see Figure 13b).
The regression model does not know that −π and π are in fact the same, as depicted
in Figure 13a.
Table 5: Exemplary angles and orientations.
Angle (in radian) ±π −π/2 0 π/2
Orientation left side rear right side front
24 ESTIMATION OF VEHICLE ORIENTATIONS
If there are two vehicle images, one which has an orientation angle in radian of 3 and
the other an angle of -3, the model assumes that these two images are very different.
However, in feature space these two images are quite similar. This can lead to a
confusion of the model and to bad predictions for left side images. This is obviously a
shortcoming which is inherent for the formulation as a regression problem.
π/2
0
-π/2
± π
-
+
(a) Cyclical angles.
-π/2 0 π/2-π π
(b) Continous angles.
Figure 13: Angle representations.
To solve the regression problem, Decision Trees and Support Vector Regression, which
were introduced in Section 2.1.3, can be utilized.
4.1.2 Multi-Class Classification
[OBTr15] reports that “multi-class SVM produced significantly better orientation es-
timation compared to support vector regression”. In order to do Multi-Class Classi-
fication the problem needs to be formulated differently. As mentioned in [Barb10], a
continuous output can be discretized and thus a corresponding classification problem
can be considered.
Instead of assuming continuous values for the orientation angles, the angles can be
divided into classes. Each class consists of a distinct range of angles. For instance,
the angles can be divided into 16 classes, so that every class contains angles from a
range of 360◦
16= 22.5◦. The division into classes can be taken from Figure 14, as well as
exemplary images from each of the 16 classes.
When one of these classes is predicted, the center value of the class is taken as the
prediction value. For example, if the fourth class is predicted, the center value, and
consequently the prediction value, is −π/2. This, however, has the shortcoming that
the amount of distinct predictions is limited to the amount of classes. It also means
that the amount of very good predictions could decrease because of this limitation.
ESTIMATION OF VEHICLE ORIENTATIONS 25
1
2
34
5
6
7
8
9
10
1112
13
14
15
16
front
rear
right sideleft side
(a) Division of angles. (b) Exemplary images from each class.
Figure 14: Angles divided into 16 classes.
4.1.3 Combined Classification and Regression
Using regression has the advantage of being able to predict all values in a range, thus
enabling the model to predict all potentially existing orientation angles. On the other
side, as mentioned earlier Classification can deliver better overall results. This can
result from the fact, that more image examples are available and thus more variation
can be learned for each class.
Combining regression and classification could lead to an overall improvement if one of
the following scenarios apply:
1. Regression or classification is working significantly better for a certain range of
angles, or
2. A relationship between the predictions of regression and classification can be
identified and exploited.
In the conducted experiments, it was found that the latter scenario applies. First, the
image which has to be predicted is passed to both the regression and the classification
model. Depending on the relation between these two predictions either the prediction
of the regression or the classification is used as the final prediction. The approach is
depicted in Figure 15.
A simple rule was identified which can help distinguish whether to use the prediction
of the regression or the classification. The rule, shown in Algorithm 2, exploits the
fact that when the two predictions of both models are close, it is better to use the
prediction of the regression model. If, however, the difference between the predictions
is large, it is beneficial to use the prediction of the classification model. Note, that the
distance function in the algorithm is the smallest angle between two angles.
26 ESTIMATION OF VEHICLE ORIENTATIONS
RegressionModel
ClassificationModel
regressionPrediction
classPrediction
DecisionRule
finalPrediction
Figure 15: Process of combined method.
Algorithm 2 Decision rule for combination of regression and classification.
procedure combinePredictions
Predict individually :
regressionPrediction← predictRegression(image)
classPrediction← predictClass(image)
Decision:
if distance(regressionPrediction, classPrediction) ≤ threshold then
finalPrediction← regressionPrediction
else
finalPrediction← classPrediction
end if
end procedure
The threshold which decides when to use either the Regression or the Classification
model is subject to optimization and has to be determined in experiments.
4.2 Classifier Modifications
This section suggests modifications to the classifiers. The weighted voting approach re-
places majority voting and incorporates the underlying circular structure of the classes.
The second approach uses specialized classifiers which only predict a certain class under
a high degree of certainty. The third approach utilizes a second layer of probabilities
instead of only handing back a binary result from the one-vs-one classifiers.
4.2.1 Weighted Voting
Assuming one-vs-one classifiers are used, a method to decide on which final prediction
to be used can be majority voting explained in Section 2.1.2.2. Instead of this decision
rule a different rule which incorporates the circular structure can be developed. The aim
ESTIMATION OF VEHICLE ORIENTATIONS 27
of this rule is to reinforce the importance of classifiers which are from opposite angles.
In contrast, classifiers for neighboring classes can be considered of less importance.
Weights can be used to implement this rule. The prediction of a classifier for opposite
angles has a large weight whereas a classifier for neighboring classes has a small weight.
Suppose that the whole range of angles is divided into six classes opposed to an earlier
example when the range was divided into sixteen classes. Then, class 1 and 4 are
opposite classes, as well as class 2 and 5, and class 3 and 6. Table 6 contains a
weight matrix which can be applied on the prediction of the one-vs-one classifiers. The
classifiers for the above mentioned opposite classes have the largest weight 5 whereas
classifiers for neighboring classes are weighted with 1. The classifiers in between have
a weight of 3.
Table 6: Example for weight matrix.
Class
1 2 3 4 5 6
Class
1 - 1 3 5 3 1
2 - - 1 3 5 3
3 - - - 1 3 5
4 - - - - 1 3
5 - - - - - 1
6 - - - - - -
Table 7 includes responses of the single one-vs-one classifiers. If majority voting is
applied on these responses, the final prediction would be class 1 because four classifiers
predicted this class. Class 4 is only predicted three times. In contrast, if weighted
voting is applied the final prediction changes to class 4 because each time a one-vs-one
classifier is predicting a class, the vote is multiplied by the weight taken from the weight
matrix. Class 4 is predicted three times with the weights 5, 3 and 3, resulting in a
weighted vote of 11, whereas class 1 is predicted four times with the assigned weights
1 + 3 + 3 + 1 = 8.
4.2.2 Specialized Classifiers
In Section 2.1.2.1, it was mentioned that a conservative predictor can be learned by
using misclassification costs. These costs have to be set in a manner that a classifier
is only predicting a certain class when it has a high degree of certainty. To this, it is
necessary to learn all possible one-vs-one classifiers, not only the 1-vs-2 classifier but
also the 2-vs-1 classifier. This results in having k ∗ (k−1) classifiers altogether whereas
for the traditional one-vs-one approach only half as many classifiers are learned.
28 ESTIMATION OF VEHICLE ORIENTATIONS
Table 7: Responses of six-class problem.
Class
1 2 3 4 5 6
Class
1 - 1 1 4 1 1
2 - - 3 4 2 2
3 - - - 3 5 6
4 - - - - 5 4
5 - - - - - 6
6 - - - - - -
If the 1-vs-2 classifier predicts class 1, it is very certain that the data point belongs to
class 1. If it however predicts class 2, then the interpretation of the result is that the
classifier is unsure whether it is class 1 or 2. To put it in another way, when the 1-vs-2
classifier predicts class 1, the result is being kept. If it predicts class 2 the result can
be discarded. The classifier can hence be seen as specialized for the prediction under
high certainty for a single class.
Consequently, in a four-class scenario, for every class there are three specialized clas-
sifiers which can tell if the seen data point belongs to it or not. For calculating the
votes this means that if a X-vs-Y classifier predicts X, the votes for X are increased by
one. If a X-vs-Y classifier predicts Y, the votes for X are not changed and the result is
discarded.
Table 8 shows exemplary responses of such specialized classifiers. The first row within
the table represents the responses of the 1-vs-Y classifiers, so only if the response is 1,
it would count as a vote for class 1. This is however not the case and the same holds
true for the 2-vs-Y classifiers. The 3-vs-Y classifiers predict 3 once, and the 4-vs-Y
classifiers predict 4 twice. Then, the majority rule is applied so that class 4 is the final
prediction as it has two votes.
Table 8: Responses of specialized classifiers.
Class
1 2 3 4
Class
1 - 2 3 4
2 1 - 3 4
3 3 2 - 4
4 4 2 4 -
ESTIMATION OF VEHICLE ORIENTATIONS 29
4.2.3 Second Layer Probability Classification
Normally, one-vs-one classifiers only predict one class and no information on how likely
this prediction was is being kept. Though, some classification models, e.g. SVM, are
able to return probabilities which provide such an information. This could help when
deciding the final prediction in the next step. Then, instead of applying the majority
rule, all these probabilities can be considered.
In the case of four classes, there are six X-vs-Y classifiers and each one of them returns
a probability value, as shown in Table 9. This value is positive if class X is more likely
and negative if class Y is more likely. Then, the six probability values can be considered
as a new feature vector. The resulting feature can be put into an additional classifier
which quasi forms a second layer of classifier. Predictors with multiple layers are also
used in artificial neural networks [Fuku80] which often consist of multiple hidden layers.
Table 9: Probability predictions of first layer classifiers.
Class
1 2 3 4
Class
1 - −0.1 −0.5 −0.9
2 - - +0.1 +0.1
3 - - −0.9
4 - - - -
The example data is the same as in Table 2 but instead of a class the classifiers are
returning a probability. In the example, three classifiers are predicting class 2 (marked
with blue), the probability for these predictions however are with 0.1 quite low. Class
4 (marked with red) is only predicted twice but the probabilities are with 0.9 a lot
larger. When applying the majority rule, class 2 would have been the final prediction.
Using a second layer of classifiers however could lead to a more detailed analysis of the
underlying probabilities. Therefore, a prediction of class 4 is expected in this example.
The second layer consists again of one-vs-one classifiers and the final output of the
second layer is obtained by applying the majority rule on its classifier results. The
necessary steps for working with two layers of classifiers is as follows:
• In the training stage:
1. Learn the first layer of one-vs-one classifiers with the training set.
2. Put the data points of the training set into the first layer classifiers and keep
their probability vectors for every data point.
3. Learn the second layer of one-vs-one classifiers with the generated probabil-
ity vectors.
30 ESTIMATION OF VEHICLE ORIENTATIONS
• In the testing stage:
1. Put a data point of the test set into the first layer of classifiers and obtain
a probability vector.
2. Put the probability vector into the second layer of classifiers to obtain the
final prediction.
Results for all of the above proposed methods are given in Section 5.2.3.
4.3 Class Imbalance
If, in a classification problem, a class contains a lot more data points than some other
class of the dataset, class imbalance is present. This is not something unusual and is
often the case in real-world scenarios.
For instance, when a car is recording images by driving through a city and capturing
images with a camera which is directed to the front, on the recorded images there will
be more vehicles from the rear and the front than vehicles from the left or right side.
Also, depending on whether the country has left- or right-hand traffic, the resulting
distribution of recorded vehicle orientations will differ. Figure 16 contains the amount
of images from the KITTI data set (see 5.1.1) if the orientation angles are divided into
16 classes as described earlier in Section 4.1.2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
1000
2000
3000
4000
5000
6000
Figure 16: Images per class.
Class 4 has the highest amount of vehicles because most of the time vehicles are
recorded which are driving in front or on neighboring lanes (classes 3 and 5). Class 13
is the second largest class, this class contains vehicles which are passing the recording
car on the left side. In a country with left-hand traffic this peak is expected to be in
class 11. However, in right-hand traffic it is visible that class 11 is small which is due to
the fact that normally there are no vehicles passing the recording car on the right side.
ESTIMATION OF VEHICLE ORIENTATIONS 31
The few images in this class are mostly vehicles which are parked on the pavement in
the opposing direction. Furthermore, the classes eight and sixteen contain images from
the right and respectively the left side of vehicles. The amount of vehicles in these
classes is also quite small, the vehicles in these two classes are typically only recorded
when the recording car is waiting in front of traffic lights and the other vehicles are
crossing the intersection.
Class imbalance can cause problems for learning a model. There are two main problems
which can arise from this:
1. The model is less likely to predict into a small class because the large classes
dominate the construction of decision boundaries.
2. The model does not learn enough variation due to the small amount of examples.
There are different methods to deal with class imbalance, i.e., under- and over-sampling
which are studied in [Japk00]. Additionally, when dealing with images, the possibility
of mirroring images exists. The methods are described and discussed in the following
sections. Issues arising from class imbalance can also be solved algorithmically by using
cost-sensitive classifier. This is not addressed here but can be found in the work of
[LiZh06] and [SKWW07].
4.3.1 Under-sampling Data Points
To obtain a more equally distributed dataset, data points from large classes could be
under-sampled. This can be achieved through removing a certain subset of data points.
An equally distributed dataset has the advantage that the probability of predicting a
class is not biased because of the existence of a large class [KuMa97]. Depending on
the chosen prediction model, removing data points can reduce the time to learn the
model. However, removing examples from the datasets comes along with learning only
reduced variation because of the smaller amount of examples shown to the model.
The main question that arises is which images can be removed without making the
prediction model worse than before:
1. Removing data points randomly:
Data points of the large classes could be removed randomly. This is easy to
implement and no further assumptions which could lead to biasing have to be
made.
The disadvantage of this approach, and also under-sampling methods in general,
is that variation in how data points can appear is being lost. In order to learn
a good model, it is inevitable to show a wide range of data points which include
32 ESTIMATION OF VEHICLE ORIENTATIONS
a certain degree of variation. Only then the model is able to generalize well also
for unseen data. The extreme counterpart would be that a model is only learned
for a handful of data points, it will most likely not be able to predict well unseen
data because the model is overfitting the training data [Hawk04].
2. Removing quasi-duplicates:
There are a lot of cars from the rear in the dataset because sometimes the record-
ing car was driving behind a single car for a long time. This results in what is
here called quasi-duplicates because the image of the car is almost the same but
not an exact duplicate. Recording almost equal images from cars can also occur
in other classes due to waiting in front of red lights but this is not happening
that often.
Figure 17 shows two consecutive frames where a car is driving in front of the
recording car. Enlarging the car reveals that there is almost no variation visible
between the enlarged vehicles. This could justify the removal of quasi-duplicates.
There is also a large amount of images from the front sides of cars. As the cars
are normally moving towards each other, the geometry of the approaching vehicle
image together with its orientation angles change more rapidly so that these cars
do not have to be considered as quasi-duplicates.
To deal with the issue of quasi duplicates, a method was implemented which is
keeping only every tenth image from the rear of cars in sequential frames. This
method incorporates the fact that a car which is in front of the recording car is
most likely still in front of the car in the subsequent frame. To be able to cope
with lane changes the threshold was set to the above mentioned value of 10.
Figure 17: Quasi-duplicates of consecutive frames.
ESTIMATION OF VEHICLE ORIENTATIONS 33
4.3.2 Over-sampling Data Points
There are different ways to over-sample data points for small classes. Two straightfor-
ward methods for over-sampling are:
1. Duplicate datapoints:
A simple method to over-sample data points is to duplicate existing data points
of small classes. The success of this method depends on the utilized classifier. For
instance, when using the k-nearest-neighbor algorithm [CoHa67], a duplicate has
certainly an impact on the classification. If, however, a decision tree is being used,
a leaf could include the same region as without the duplicate and the duplicate
would have no effect on the resulting prediction.
Another obvious shortcoming of the duplication of data points is that no ad-
ditional variation can be learned by the model from the duplicates. Quite the
opposite can be the case, namely the model can be overfitted by the sole replica-
tion of images.
2. Create synthetic datapoints:
[CBHK02] introduces SMOTE (Synthetic Minority Over-sampling Technique)
which can overcome the problems of creating exact duplicates. In the mentioned
work it is shown, that by creating synthetic data points better decision bound-
aries, i.a. for decision trees, can be learned. Furthermore, it was shown that
duplicates do not help learning better decision boundaries in the same extent.
The synthetic data points are created by introducing data points which have
small variation compared to existing data points. This work is using images and
it is not obvious how a vehicle image can be changed directly in a legitimate way.
However, it is conceivable to synthetically modify the feature vector created from
the original image in their proposed way.
In [CBHK02] it was shown that a combination of under-sampling the large class and
over-sampling of the small class can lead to an improvement of certain classifiers. On
the other hand, [DrHo03] shows that for C4.5 [Quin93], an algorithm to learn decision
trees, under-sampling performs better than over-sampling.
4.3.3 Mirroring Images
Another method to overcome class imbalance is to increase the number of images by
mirroring images. The opposite class of a small class is chosen, e.g., the opposite class
of class 11 is class 13, and the opposite class of class 10 is class 14. Then, the images
from the opposite class can be mirrored by simply flipping the pixel values. Example
34 ESTIMATION OF VEHICLE ORIENTATIONS
images from a small class and an opposite class, as well as the mirrored vehicle image
are shown in Figure 18.
(a) Small class. (b) Opposite class. (c) Mirrored image.
Figure 18: Image mirroring.
The image from the small class shows the untypical case when the right front of a
vehicle is seen which is only happening in exceptional situations in right-hand traffic.
By mirroring images the amount of image examples is increased with natural images.
One could argue that these images are not natural in the sense that the driver and
the steering wheel is on the wrong side. As the utilized image descriptors do not rely
directly on this and are more abstract, it can be considered being equal to the original
images.
The main advantage of mirroring images is that the prediction model can learn a lot of
variation from these new images. As the images are naturally created, it is not random
noise which is learned but real variation which can occur in the real world. The amount
of images to be flipped might have to be limited to a certain amount of images. This is
in particular the case when mirroring images from class 13 for the small class 11. The
distribution of images would change substantially which could have undesirable effects
mentioned in Section 4.3.4.
Table 10 summarizes the advantages and disadvantages of the methods explained in
the preceding sections.
Table 10: Characteristics of methods to balance classes.
Method Under-sampling Over-sampling Mirroring
Advantage No domination of
large classes
Increased impact of
small classes
Additional variation
Disadvantage Loss of variation Few additional vari-
ation
Potentially too many
new data points
ESTIMATION OF VEHICLE ORIENTATIONS 35
4.3.4 Changing the Value Distribution
All previously described methods have a shortcoming which was not mentioned until
now. Normally, it is assumed that the distribution of the output values is the same
for the training and the test set. When the distribution is changed by over-/under-
sampling or through the mirroring of images, the model is more likely to reproduce
this modified distribution, even if the distribution of the test set is different.
This problem is investigated in the field of Domain Adaptation (see also [DaMa06])
but is not addressed in this work.
EXPERIMENTS 37
5 Experiments
This section gives a detailed description of the used dataset and how the results are
measured. The proposed features as well as the different approaches described in the
preceding sections are investigated and the results are reported and explained.
5.1 Datasets
Finding images of cars is not a problem, search engines can provide a plethora of
car images. Though, finding useful labeled images can be a problem. The ImageNet
database2 [DDSL09] comprises a lot of car images, the annotations are however limited
to car types and no orientation angles for the cars are given. To be able to predict vehi-
cle orientations, it is a necessity that the dataset offers annotations for these orientation
angles. Some datasets which include these annotations are:
1. KITTI Vision Benchmark Suite (Karlsruhe Institute of Technology and Univer-
sity of Toronto) [GeLU12];
2. NYC3DCars (Cornell University) [MaSn13];
3. Pascal3D+ dataset (Stanford University, 12 categories thereof one category with
cars) [XiMS14].
The experiments conducted in this work are using the very popular KITTI Vision
dataset because it is recorded by a car driving in normal traffic, hence close to the later
use case for driver assistance systems. The other two datasets consist mostly of car
pictures taken by pedestrians from outside the driving lanes.
5.1.1 KITTI Vision Dataset
The datasets for the KITTI Vision Benchmark Suite was recorded by driving through
a German city and the surroundings. The car was equipped with the following sensors
(see also Figure 19):
• 2 color and 2 gray scale cameras;
• 1 Velodyne laser scanner;
• 1 GPS unit.
2http://www.image-net.org/ [Accessed December 18, 2015]
38 EXPERIMENTS
Figure 19: Sensors on recording car [GeLU12].
With the recorded datasets different challenges are created. When taking part in a
challenge the results can be submitted to an evaluation server and the results are doc-
umented in a benchmark on the KITTI Vision project page.3 The following challenges
are available:
1. Stereo Evaluation 2012/2015;
2. Optical Flow Evaluation 2012/2015;
3. Scene Flow Evaluation 2015;
4. Visual Odometry/SLAM Evaluation 2012;
5. Object Detection Evaluation 2012;
6. Object Tracking Evaluation 2012;
7. Road/Lane Detection Evaluation 2012.
The Object Detection Evaluation 2012 consists of stereo images and point clouds of the
Velodyne scanner. This work however is only using forefield images, as shown in Figure
20, from one of the cameras. There are in total 7,500 images with around 22,000 cars
on these images. On images with heavy traffic, there can be more than ten cars per
image, but there are also images without any cars. The annotations for the forefield
images comprise:
• Type of object (car, truck, pedestrian, cyclist, . . . );
• Bounding boxes (where in the image is the vehicle);
• Orientation angle ∈ (−π, π];
• Occlusion (object is occluded by other objects);
3http://www.cvlibs.net/datasets/kitti/index.php [Accessed December 18, 2015]
EXPERIMENTS 39
• Truncation (object is not completely inside the forefield image).
(a) City environment.
(b) Countryside environment.
Figure 20: Exemplary forefield images.
Depending on the degree of occlusion and truncation as well as the size of the car
image, the images are divided into three difficulties. The easy images are contained
in the moderate set, and the moderate images are contained in the hard set (easy ⊂moderate ⊂ hard). The accuracies obtained in the single difficulties are reported
separately on the benchmark page. Table 11 gives a more detailed explanation of the
difficulties.
Table 11: Difficulties of the vehicle images.
Difficulty Bounding Box Height Occlusion Truncation
Easy ≥ 40 pixel Fully visible ≤ 15 %
Moderate ≥ 25 pixel Partly occluded ≤ 30 %
Hard ≥ 25 pixel Difficult to see ≤ 50 %
The dataset was recorded on sunny days and only during daytime. Hence, testing
approaches which can deal with different weather conditions is not possible by using
this dataset. Still, it is a very popular dataset among researchers and a lot of research
was conducted by using this dataset for detection and orientation estimation since its
release in 2012 up until now [PSGS13], [OBTr14], [LiWZ14], [XCLS15], [OBTr15].
40 EXPERIMENTS
5.1.2 Evaluation Metrics
For every prediction task, an evaluation metric is necessary which quantifies the good-
ness of results. In [GeLU12], a metric called Average Orientation Similarity (AOS)
is introduced. The metric evaluates jointly the detection results and the orientation
estimation. If, however, only the orientation results shall be measured a simplified
version can be derived which can be taken from Equation (5).
m =1
|D|∑i∈D
f(∆i) (5)
D is the image dataset and i is an image from this set. f(·) is the actual evaluation
function which quantifies how good or bad a prediction was which in turn depends on
the deviation ∆i between the predicted and the actual value of the i-th image. To
evaluate the results, for every image from the dataset, the value of f is calculated as a
function of the deviation. Then, to obtain the evaluation metric the normalized sum
of these values are computed.
The maximal difference between a predicted angle and the ground truth of the angle
is 180◦ (or π). This implies that two basic requirements should hold for the evaluation
function f :
1. Deviation equals 0 (perfect prediction), the value of the function should be 1.
2. Deviation equals π (worst prediction), the value of the function should be 0.
5.1.2.1 Cosine Function. A modified cosine function was suggested by [GeLU12]
for the evaluation. It can be taken from Equation (6).
f(∆i) =1 + cos(∆i)
2(6)
This function fulfills the basic requirements stated above but it does not penalize small
errors a lot as can be seen in Figure 21a. This means that at the beginning when
the error grows larger the value of the function does not decrease as fast. As it is
characteristic for cosine functions, when the error continues growing the value of the
function approaches 0 quite fast.
In some experiments which are described in more detail in the following sections, an
undesirable effect was visible. Assuming that two experiments were made, in the second
experiment the amount of perfectly predicted images was decreased by 100 and the
amount of predictions with small errors was increased by 110, the second experiment
had a better evaluation metric compared to the first one. So, it can be argued that
already small errors are not desirable and should be penalized accordingly.
EXPERIMENTS 41
0 0.5 1 1.5 2 2.5 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) Modified cosine function.
0 0.5 1 1.5 2 2.5 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) Modified hyperbolic tangent function.
Figure 21: Evaluation functions.
5.1.2.2 Hyperbolic Tangent Function. In order to penalize small errors in a
larger extent, this work suggests a modified hyperbolic tangent function (see Equation
(7)) which has the desired property.
f(∆i) = 1− tanh(∆i) (7)
For an increasing error the value of the function is decreasing rapidly which is illustrated
in Figure 21b. That implies that small errors have an higher impact on the overall
performance. It can be said that this evaluation function is more pessimistic than
the cosine function. However, a shortcoming of the tanh function is that the second
requirement is not fulfilled because f(π) = 1− tanh(π) = 0.0037 and not 0 as desired.
5.2 Orientation Estimation
The KITTI Vision dataset consists of two sets. A labeled training set of forefield images
with the previously mentioned annotations and an unlabeled test set of forefield images.
As the test set is used to create the submission for the benchmark, it comes without
any annotations.
However, to evaluate the experiments for the orientation estimation, annotations are
necessary. Therefore, the KITTI training set is divided into ten subsets, or so-called
folds. It is ensured that each fold has almost the same number of forefield images.
The forefield images were recorded within independent drives on several days but in
each drive quasi-duplicates exist (see Section 4.3.1). This makes it necessary to divide
the available images so that all images from one drive are only contained in exactly
one of the folds. This is ensured by using the mapping information provided in the
development kit which includes the assignment of images to drives.
42 EXPERIMENTS
The results reported in this section are obtained by using 10-fold cross-validation
[Koha95]. For this, nine folds are used to learn a model and the tenth fold is used
for evaluation. This procedure is repeated so that each fold has been in the evalua-
tion fold once. The overall quality is reported by averaging the qualities of the single
cross-validation runs.
After the forefield images are divided in the above described way, the vehicles on the
images can be cropped using the provided bounding box annotations (see Figure 22).
This work focuses on car images, so that only these images are cropped out and trucks,
pedestrians, cyclists and others are omitted.
Figure 22: Cropping cars using the provided annotations.
5.2.1 Histogram-based Features
In this section, the results for the HOG and HSC experiments are reported, as well as
a comparison of these two image descriptors. Here, the formulation as a classification
problem is taken as a basis. In Section 5.2.2, the results for regression and also the
combined method are given.
5.2.1.1 HOG Features. An important parameter for the creation of the HOG
features is the number of vertical and horizontal cells. If too few cells are chosen, the
feature is not descriptive enough. However, too many cells could capture too much
information which can be counterproductive as the model will not be able to identify
EXPERIMENTS 43
the relevant information. It was found that a small number of cells, 4 vertical and
6 horizontal cells, is aggregating the image information strong enough while keeping
enough information about the edge geometry of the car in the image. The cells are
grouped into blocks of 2 x 2 cells.
The results for the described parameter setting are given in Figure 23. The bars
of the diagram tell how many images are predicted with a certain deviation. The
deviation is the smallest angle between the predicted pose p and the actual pose a
(both p, a ∈ (−π, π]). Meaning the highest deviation possible is π. For instance,
the first bar on the left means that a bit less than 3,500 images are predicted with a
deviation less than 0.17 which is approximately 10◦. These predictions are considered
as being nearly perfect. Then, the second bar states how many predictions there are
with an error between 0.17 and 0.34, and so on. The colors in the diagram tell from
which class the image example came from. Images from “−2.75” belong to class 1, and
up until “3.14” which contains images from class 16 as depicted in Figure 14.
Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
HOG features(cosine metric: 0.9490)(tanh metric: 0.8383)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
Figure 23: HOG results.
A. Analyzing Large Errors. Figure 23 reveals that there are some car images which
are predicted with the maximal error of 3.14. These are cars which are predicted
as being from the exact opposite angle which was also observed in [RHMH10]. For
instance, an image which is predicted as the rear of a vehicle but is actually an image
of the front. A few examples of images with a large error can be found in Figure 24.
The annotation dev = 3.14 means that the deviation between prediction and ground
truth is 3.14 ≈ π = 180◦, thus the maximal deviation.
B. Neighborhood of Images. To further analyze why a certain image is predicted
with a large error, the closest neighbors in feature space were investigated. To deter-
mine those neighbors, the distance between the features of two images can either be
44 EXPERIMENTS
#1, dev = 3.14 #2, dev = 3.14 #3, dev = 3.14 #4, dev = 3.14
#5, dev = 3.14 #6, dev = 3.14 #7, dev = 3.14 #8, dev = 3.14
#9, dev = 3.13 #10, dev = 3.13 #11, dev = 3.13 #12, dev = 3.13
#13, dev = 3.13 #14, dev = 3.13 #15, dev = 3.13 #16, dev = 3.13
Figure 24: Examples of badly predicted images.
measured with the l2-norm, or with the cosine similarity which is more suitable for
high-dimensional data. Figure 25 shows the nine closest neighbors of one of the badly
predicted cars which is shown in the top-left corner.
Neighbors which come from the same actual class as the badly predicted car are anno-
tated with red, neighbors which are in the same predicted class as the badly predicted
car are annotated with green. Looking at the nearest neighbors, it becomes apparent
that there are more neighbors from the predicted wrong class than from the actual
correct class. Therefore, it is not surprising that the prediction is wrong. Moreover,
when comparing the neighbors from different classes to each other, it is not a surprise
that these images are close in feature space. The overall orientations of the edges are
quite similar and only this is being captured by the HOG features.
To overcome this issue, it is conceivable that the HOG feature is combined with a
blob detector [Lind98] which is supposed to find tail or head lights in an image. For
instance, if tail lights are being detected, the prediction of the classes with front images
is blocked and only classes with rear images can be predicted. This approach would,
however, not work for side images and is not investigated further in this work.
C. Image Preprocessing. Some of the cars from Figure 24 are over- or underex-
posed which could justify the necessity of preprocessing the images. Image 3 and 11
are overexposed due to too much sunlight, image 8 and 16 are underexposed because
the cars are located in the shadow. The HOG feature is then not capable of capturing
the edges because there is not enough contrast in the image.
EXPERIMENTS 45
predicted imageactual class: 11
predicted class: 03
actual class: 11 actual class: 03 actual class: 11
actual class: 03 actual class: 03 actual class: 03
actual class: 03 actual class: 03 actual class: 03
Figure 25: Neighborhood of badly predicted image.
To solve this problem, a simple method was implemented which increases the contrast
in a different range depending on whether an over- or underexposed image was found.
However, the overall quality worsened when preprocessing all images this way.
5.2.1.2 HSC Features. To make it comparable, the same number of cells was used
to calculate the HSC features. The result diagram for HSC can be taken from Figure
26. It reveals that 79 images more could be predicted nearly perfect when using HSC
compared to HOG. Also, what is even more important, the number of images with
a large prediction error was reduced by 74. A more detailed comparison is given in
Section 5.2.1.3.
Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
HSC features(cosine metric: 0.9656)(tanh metric: 0.8533)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
Figure 26: HSC results with 256 dictionary elements.
46 EXPERIMENTS
Dimensionality of HSC Features. A problem of the HSC feature is its high di-
mensionality (see Section 3.2.5). It leads to a long computation time, especially for the
model training but also for the prediction. By applying Principal Component Anal-
ysis (PCA) [Pear01], [Hote33] or the faster Probabilistic PCA [TiBi99], it was tried
to reduce the dimensions. But as computing the principal components and rebuild-
ing the features is computationally expensive itself, the problem of faster training and
prediction could not be solved with these methods.
As the number of cells should stay the same as for HOG, it remains the number of
dictionary elements which can be reduced to get a reduction of the dimensionality of
HSC. The effect of dictionaries with a varying number of elements on the computation
time and the quality of the predictions can be taken from Figure 27.
dictionary length25 64 100 256
cosi
ne m
etric
0.9
0.92
0.94
0.96
0.98
1
0.955
0.964 0.964 0.966
com
puta
tion
time
[min
]
0
10
20
30
40
50
60
70
80
Figure 27: Dictionary length vs. accuracy vs. time.
When using a dictionary with 256 elements, it takes 60 minutes to do the cross-
validation and the cosine metric is 0.966. While reducing the elements in the dictionary,
a nearly linear reduction of the computation time can be observed (red curve). Besides
that, it is observable that the quality of the predictions is almost constant up until
64 dictionary elements (blue curve). In summary, it can be said that by reducing the
dictionary length the computation time is reduced and at the same time only a small
decrease of quality is visible.
When real-time computation is necessary, as it would be the case for a driver assis-
tance system, the long computation time of the HSC feature with 256 elements is a
shortcoming. Using a smaller dictionary would be more appropriate in this scenario.
5.2.1.3 Comparison of HOG and HSC. In the preceding section, it was men-
tioned that by using HSC the number of perfectly predicted cars was increased and
EXPERIMENTS 47
the number of predictions with large error could be reduced. Table 12 is quantifying
the findings. Besides the evaluation metrics and the computation time for the cross-
validation process, the percentage of predictions with a certain error is reported. A
deviation between prediction and ground truth of less than 0.33 is considered small, a
deviation between 0.33 and 2.81 medium, and a deviation larger than 2.81 large.
Table 12: Comparison of HOG and HSC.
HOG HSC (64) HSC (256)
cosine metric 0.949 0.964 0.966
tanh metric 0.838 0.850 0.853
small errors 93.5 % 94.7 % 95.0 %
medium errors 2.3 % 2.5 % 2.3 %
large errors 4.2 % 2.8 % 2.7 %
comp. time 3.3 min 10.0 min 59.2 min
Two different HSC features are used, once with 256 dictionary elements and the other
time only 64 elements. Both HSC features deliver better results compared to the HOG
feature. Though, using HOG features is faster than both of the HSC features.
5.2.2 Regression vs. Classification
As described in Section 4.1, the estimation of orientations can be formulated as a
regression or a classification task. This section gives results for both approaches as
well as for the proposed combined method. For the results reported here HOG features
were used, however, the HSC features behave equally.
5.2.2.1 Regression Results. Firstly, the results for two different regression mod-
els are given, a regression tree as described in [BFSO84] and SVR as in [VaVa98].
A. Regression Tree. The tree is pruned to avoid making the model overly complex.
This means that branches are cut off if the information gain is below a certain threshold.
As split criterion, the mean squared error is used. Modifying this criterion so that it
incorporates the circular structure of angles is conceivable and could lead to further
improvement of this regression approach.
Figure 28a shows the results of the used regression tree. Most predictions are very well
with a small error, yet another peak with the badly predicted images is visible. Also,
48 EXPERIMENTS
there is a small number of wrongly predicted images in the whole range of errors in
between.
Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Regression Tree(cosine metric: 0.8364)(tanh metric: 0.7258)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
(a) Regression Tree.
Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
SVM Regression(cosine metric: 0.8299)(tanh metric: 0.4993)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
(b) Support Vector Regression.
Figure 28: Comparison of regression results.
B. SVM Regression. For the SVR predictions the LIBSVM library [ChLi11] is used.
A linear kernel is used and the parameter configuration is as follows: C = 1, ε = 0.1.
The results of the SVR predictions can be taken from Figure 28b. The results are
remarkably different from the results which were obtained so far. The number of
nearly perfect predicted images is a lot smaller, but also the predictions with a large
error are a lot less compared to the regression tree. The two peaks which are visible
in the regression tree are not noticeable when applying SVR. For SVR, a declining
number of predictions is detectable for every range of errors.
When comparing these two methods based on the evaluation metrics described in
Section 5.1.2, it becomes apparent that the values for the cosine metric is with 0.836
and 0.830 close. Though, it can be argued that a model which is predicting very well in
most cases and failing only in a few is better than a model which is making erroneous
predictions most of the time. This is uncovered by the hyperbolic tangent metric, where
the difference between the tree and the SVR is rather large with 0.726 and 0.499. A
detailed overview is given at the end of Section 5.2.2.3.
5.2.2.2 Classification Results. The range of orientation angles is divided into 16
classes of equal size as it was also done in [GeLU12] and [YBAL14]. Each class contains
angles from a range of 360◦
16= 22.5◦. Using this classification approach means that
only the 16 distinct class centers can be predicted. The maximal deviation between all
possible orientation angles and the class centers is 22.5◦
2= 11.25◦. The number of classes
is set to 16 because the mentioned maximal deviation can be considered acceptable.
EXPERIMENTS 49
Dividing into even more classes could have the undesirable effect that there will be
classes with even fewer or no data points at all.
SVM Classification. For this multi-class classification problem, one-vs-one classi-
fiers are used and the majority rule is applied on the single predictions. Again, the
LIBSVM library [ChLi11] is used to conduct the necessary computations. A linear ker-
nel is found to work best. This can be explained by the fact the the feature vector is
quite big and a more complex kernel, like a radial basis function (RBF) or polynomial
kernel, is not capable of better separating the data points. The cost parameter C of
the linear kernel is subject to optimization and is tuned by a reduced grid search (see
Figure 29).
cost parameter0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
cosi
ne m
etric
0.9
0.91
0.92
0.93
0.94
0.95
Figure 29: Tuning the cost parameter for the linear kernel.
The results of the predictions can be taken from Figure 23. Like in the regression case,
there is a peak for nearly perfect predicted images but another peak for very badly
predicted images. The main difference is that in the classification case, there are not
so many incorrectly classified examples over the whole range of deviations. Also, the
amount of images in the second bin is, with 1,200 compared to 400 in the regression,
a lot larger.
5.2.2.3 Combined Classification and Regression. As it was mentioned in Sec-
tion 4.1.3, a combination of classification and regression may be beneficial if a certain
relationship between these two approaches can be detected. For that, a result figure
with a smaller bin size for the errors is given in Figure 30. The results are the same,
the smaller bin size only enables a more detailed view on the predictions.
The regression provides perfect predictions (deviation < 0.08) for around 2,500 images
whereas with classification only 1,500 images are perfectly predicted (see Figure 30a).
This can be explained with the fact that for the classification predictions only the 16
50 EXPERIMENTS
Deviation0.08 0.42 0.76 1.10 1.44 1.78 2.12 2.46 2.80 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Regression Tree(cosine metric: 0.8364)(tanh metric: 0.7258)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
(a) Regression results.
Deviation0.08 0.42 0.76 1.10 1.44 1.78 2.12 2.46 2.80 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
SVM Classification(cosine metric: 0.9490)(tanh metric: 0.8383)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
(b) Classification results.
Figure 30: Comparison of regression and classification.
class centers are available. Hence, some images cannot be predicted more accurately
because some angles are in between these centers and will always be erroneous. For
classification, the second and third bin in the figure which consists of predictions with
an acceptable error (see Figure 30b) is a lot larger than the corresponding bins at the
regression. As previously mentioned, it follows for the other error ranges a lot of badly
predicted images for regression and not so many for the classification.
To sum up, regression is predicting very well most of the time, however, if it fails the
error can be anything from a small error to a large error. Classification is predicting
well in most cases but if it fails the error is most likely large. With this information,
the earlier described method was developed which incorporates the advantages of both
methods. The threshold θ which decides when to use which model was optimized. The
best results where obtained setting it to θ = 0.25.
Figure 31 shows that combining the two approaches was beneficial. Firstly, the number
of perfect predictions was increased, now even more images are predicted perfectly com-
pared to the regression. Secondly, the erroneous predictions over the whole spectrum
of errors was reduced because then the classification model is taking over. Though,
the number of badly predicted images was not reduced because this is a shortcoming
in both prediction approaches.
Another point which can be taken from the figures is that while there is obviously a
strong improvement the cosine metric does not change a lot, from 0.949 when only
applying SVM classification compared to 0.950 when using the combined method.
However, in the hyperbolic tangent metric the change is more visible where it changes
from 0.838 to 0.870.
Table 13 gives an overview of the tested approaches and methods. SVR has the smallest
percentage of large errors but is exceeded by the other methods in all other categories.
EXPERIMENTS 51
Deviation0.08 0.42 0.76 1.10 1.44 1.78 2.12 2.46 2.80 3.14
Num
ber
of Im
ages
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Regression Tree + SVM Classification(cosine metric: 0.9504)(tanh metric: 0.8698)
-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14
Figure 31: Results of combined classification and regression.
The regression tree is the fastest when it comes to learning and predicting the model.
When only comparing the error percentages of the SVM classification and the combined
method, one could infer that SVM classification is better, however, these percentages
depend on where the boundaries are being set. The evaluation metrics report better
results for the combined method because they take into account that the combined
method has the largest number of perfect prediction (deviation < 0.08). Although,
the computation time for the combined method is understandably longer and amounts
to the sum of the computation time for the regression trees and the SVM classification
plus the overhead for deciding which model to take.
Table 13: Comparison of regression and classification methods.
Regr. Tree SVR SVM Classif. Combined
cosine metric 0.836 0.830 0.949 0.950
tanh metric 0.726 0.499 0.838 0.870
small errors 73.3 % 32.4 % 93.5 % 92.8 %
medium errors 17.2 % 66.7 % 2.3 % 3.0 %
large errors 9.5 % 0.9 % 4.2 % 4.2 %
comp. time 0.7 min 7.7 min 3.3 min 4.3 min
5.2.3 Results for Classifier Modifications
This section reports the results for the classifier modifications which were introduced
in Section 4.2. The first approach incorporates the circular structure of the classes
by using weighted voting. The second method is using specialized classifiers which
are trained to predict a class only under high certainty. The third approach is using
a second layer of classifiers which is using probabilities from the one-vs-one classifier
52 EXPERIMENTS
instead of doing a majority vote. Table 14 lists the results of the three modified
approaches and compares them to the application of the simple majority vote.
Table 14: Comparison of classifier modifications.
Majority V. Weighted V. Spec. Classif. Prob. Classif.
cosine metric 0.9490 0.9489 0.9488 0.9211
tanh metric 0.8383 0.8379 0.8381 0.7921
small errors 93.5 % 93.5 % 93.5 % 85.5 %
medium errors 2.3 % 2.3 % 2.3 % 8.7 %
large errors 4.2 % 4.2 % 4.2 % 5.8 %
comp. time 3.3 min 11.3 min 22.1 min 15.8 min
Weighted voting and the specialized classifier produce nearly the same results compared
to majority voting. The difference between the three methods is only visible in the
fourth decimal place of the metrics. Weighted voting has a considerably longer com-
putation time compared to majority voting. This is due to the fact that this method
was implemented in MATLAB whereas the majority voting is already implemented
in the LIBSVM library which is based on C++. The method with the specialized
classifiers also has a long runtime because twice as many classifiers need to be trained.
The results which are produced by the approach with a second layer of classifiers are
not as good compared to the other three methods. Also, the runtime is high for this
approach because probabilities have to be calculated in the first layer and a second
layer of classifiers has to be learned.
To sum up, despite the fact that majority voting is a simple rule it, it produces slightly
better results than all of the other methods. Incorporating information about the
circular structure of angles or the extraction of a detailed view in form of probabilities
did not lead to a quantifiable improvement of the overall results in this particular case.
5.3 Joint Object Detection and Orientation Estimation
Until now, the experiments were conducted by using the cropped car images. To
evaluate how well the developed orientation estimator is working it has to be compared
to other models. This was done by participating in the so-called “Object Detection and
Orientation Estimation Evaluation” provided by the KITTI Vision developers. For the
evaluation, the test set which comes without annotations is used. The bounding box
annotations and the orientation angles need to be created and then uploaded to the
evaluation server. The server executes the evaluation automatically and reports the
EXPERIMENTS 53
results by comparing the submitted annotations with the ground truth which is not
publicly available.
To be able to submit results a joint task has to be accomplished. Firstly, an object
detector needs to find all cars in the forefield images, secondly, the orientation angle
for every found car has to be predicted by the orientation estimator. The process is
depicted in Figure 32.
Orientation Estimation Model
angle1=1.96 angle 2=1.57 angle 3=−1.57
Object Detection Model
Figure 32: Joint detection and estimation orientation.
The Average Orientation Similarity (AOS) reported by the evaluation server incor-
porates not only the quality of the orientation estimation but also the quality of the
detection. The object detector quality is an upper limit for the orientation estimator
quality with the result that the AOS can never be higher than the detection metric.
This means that a poor vehicle detector in combination with a great orientation esti-
mator will still result in a poor evaluation score. To achieve good results, both a good
orientation estimator and a good detector is necessary.
The approaches for this combined challenge can be differentiated into two categories:
54 EXPERIMENTS
1. Learning a joint model which is able to detect a vehicle and, at the same time,
predicts the orientation. This can be beneficial in the way that information, which
is generated while looking for a car, can be used for the orientation estimation.
2. A two-step approach where the detection is done first and the orientation esti-
mation model is built on top as the second step.
As this work is focusing on orientation estimation, the second approach was chosen
where the orientation estimation works independently from the detection. First, an
object detector was trained with the help of publicly available code from [OBTr15].
To end up with comparable results, in a second step, the exact detection results which
were provided by the researchers of [OBTr15] were utilized.
5.3.1 Processing Pipeline
To create the submission results, first the object detector has to be trained or alterna-
tively a pre-trained model can be used. Then, for the orientation estimator the best
model which was determined by using cross-validation has been chosen (see Section
5.2). The next step is to detect cars in the KITTI Vision test set with the object
detector which returns a bounding box and a detection score. The detection score
expresses the confidence of the detection and is used to create recall/precision-curve as
well as to calculate the evaluation score. By means of the bounding boxes, the cars are
cropped from the test images and the feature vector (HOG or HSC) for the cropped
image is computed as it was the case for the training images. The vector is handed
to the orientation estimator which returns the orientation angle of the car. To give a
better overview, the processing pipeline looks as follows:
1. Train (or load a pre-trained) object detector,
2. Train orientation estimator,
3. For each forefield image in the test set:
(a) Detect cars in the forefield image (return bounding box and detection score),
(b) Crop cars from the forefield image,
(c) For each cropped car:
i. Generate feature vector,
ii. Predict the orientation of the car (return orientation angle).
EXPERIMENTS 55
5.3.2 Submission Results
As mentioned earlier, two different detectors were used. Since the AOS depends on the
object detection results, the results of two different detectors cannot be compared to
each other, only submissions which used the same detectors can be compared reliably.
The AOS is reported for the three difficulties (easy, moderate, hard), however, the final
ranking is only based on the score of the moderate case.
5.3.2.1 Trained Object Detector. The first two submissons used a detector
which was trained as described in [OBTr15]. The orientation estimator is using HSC
features with a dictionary of 64 elements. Quasi-duplicates of rear images were removed
from the cropped car images. The prediction was done by using multi-class SVM clas-
sification which uses one-vs-one classifiers and majority voting. In the training step of
the orientation estimator for submission (A) only images from the “easy” subset were
used. This subset contains neither truncated nor occluded cars and the vehicle images
must have a height larger than 40 pixels. For submission (B), also partly truncated
and partly occluded cars as well as cars with a minimal height of 25 pixels were used.
The results for both submissions can be taken from Table 15.
Table 15: Results of trained detector.
Submission Moderate Easy Hard
(A) Trained on easy set 68.82 % 76.36 % 54.87 %
(B) Trained on moderate set 70.41 % 76.53 % 56.63 %
As the final ranking is done by only considering the results for the moderate case, a
description of only this case is done here. For completeness, the results for the easy
and hard case are also reported in the tables. The results improved by 1.59 % when
the training of the orientation estimator included the images from the moderate set as
it was the case in submission (B). This means that teaching the model more variation,
in the sense that a car can also be partly occluded or truncated, led to an improvement
in the overall results. Thus, the orientation estimator is not as dependent on a perfect
car image as before.
5.3.2.2 Provided Detection Results. As mentioned above, the exact detection
results of [OBTr15] for the KITTI Vision test set were provided to make a comparison
possible. Four submissions (C)–(F) were made using these detection results. The setup
for submission (C) was the same like in submission (B) with the difference that instead
of only applying classification, the combined classification and regression approach was
used. For submission (D), also mirrored images for small classes were used. Submission
56 EXPERIMENTS
(E) does not use mirrored images and also the removal of quasi-duplicates was omit-
ted. And the final submission (F) used class weights to deal with the class imbalance
algorithmically. The results of the submissions are reported in Table 16.
Table 16: Results with provided detection.
Submission Moderate Easy Hard
(C) Removal of quasi-duplicates 73.77 % 82.97 % 58.14 %
(D) As (C) + mirrored images 73.67 % 83.00 % 57.99 %
(E) No removal, no mirroring 73.95 % 83.07 % 58.29 %
(F) As (E) + class weights 73.80 % 82.93 % 58.18 %
The improvement from submission (B) to submission (C) by 3.36 % is mainly caused by
the improved detection results. As mentioned earlier, results with different detectors
are not directly comparable. However, it is likely that also the combined classification
and regression as used in (C) led to a small overall improvement. Adding mirrored
images in submission (D) did not improve the overall results, this can be due to the fact
that the distribution of orientation angles is different in the training set (with mirrored
images) and the test set (no mirrored images). For submission (E), no quasi-duplicates
were removed and also no mirrored images were added. That way the distribution of
orientation angles is the same in the training as well as in the test set. This submission
provided the best overall results. The introduction of class weights, as it was done in
submission (F), which makes it more likely to predict small classes did not improve the
overall results.
To conclude the experiment section, Table 17 is giving a comparison between the
orientation estimator of [OBTr15] and the best submission which was produced in
this work. Note that the two estimators are using the same detection results and can
therefore be directly compared. Other orientation estimators are not listed here but
can be taken from the web page of the challenge.4
Table 17: Comparison of orientation estimators.
Rank Submission Moderate Easy Hard Runtime
6 [OBTr14] 74.42 % 83.41 % 58.83 % 0.7 s
7 (E) Best own submission 73.95 % 83.07 % 58.29 % 5.5 s
The results of the competing model slightly exceed the best submission of this work
by 0.47 %. The runtime which is an average of the processing time for each forefield
image is higher for the own submission. As the experiments were implemented using
4http://www.cvlibs.net/datasets/kitti/eval_object.php [Accessed December 18, 2015]
EXPERIMENTS 57
MATLAB this does not come as a surprise, by changing to, for example, C++ a
considerable acceleration of the processing time can be reached.
CONCLUSION & FUTURE WORK 59
6 Conclusion & Future Work
In this work, a vision-based model was developed which is able to estimate orientations
of vehicles. The model can be plugged onto an object detector and only requires the
bounding box which contains the detected vehicle. Besides the bounding box no further
information is necessary to successfully estimate the vehicle’s orientation. Thus, the
model can be used together with any arbitrary object detector which was demonstrated
by taking part in a prediction challenge. Good results were achieved without needing
any additional information from the detector.
Extensive experiments for different prediction methods were conducted and reported.
Multi-class classification with an SVM produced very good overall results. The sug-
gested classifier modifications were not able to outperform the simple one-vs-one clas-
sifiers with majority voting. Furthermore, it was shown that a regression tree is pre-
dicting a lot of vehicles with a small error but the overall results are not as good as
an SVM classification. A new approach was developed which improved the results by
combining the advantages of SVM classification and a regression tree.
Two different image descriptors were used, both are implicitly detecting edges and
aggregating those edges into histograms. It was shown that the HSC features is per-
forming better than the HOG features, especially when it comes to reducing large
errors. However, HOG features are faster to compute than HSC features. To further
speed-up the computation of HSC features, the implementation can be parallelized and
implemented in a compiled language like C++.
Preliminary experiments for difficult lightings were conducted. A further improvement
of the results could be achieved by preprocessing over- and underexposed images. The
preprocessing should ensure that all edges are clearly visible because this would enable
a better detection of edges and therefore the utilized feature would be more descriptive.
To reduce the number of large errors, future work could include the detection of tail-
or headlights on a vehicle. This additional information could be used to prevent the
prediction of some angles. For example, when taillights are detected no front view angle
is allowed to be predicted. It should be noted that in a real world scenario redundant
systems are tracking the orientations of vehicles and sequential information is used. An
erroneous prediction from the vision-based orientation estimator can be detected by
other systems due to deviating information from these systems and can be overruled.
This work is using a dataset which consists only of daytime images. During the day,
the edges of vehicles can be captured quite well in most cases. At night, a camera
which captures only the visible light will most likely not be able to capture the edges
in the same way. Therefore, a night vision camera which records light from the infrared
60 CONCLUSION & FUTURE WORK
or ultraviolet spectrum could be of help. As there are no such datasets available at the
moment, an investigation regarding this is pending.
REFERENCES xiii
References
[AhEB06] Michael Aharon, Michael Elad and Alfred Bruckstein. K-SVD: An Algo-
rithm for Designing Overcomplete Dictionaries for Sparse Representation.
Signal Processing, IEEE Transactions on, 54(11), 2006, pp. 4311–4322.
[Barb10] David Barber. Bayesian Reasoning And Machine Learning. Cambridge
University Press. 2010.
[BETV08] Herbert Bay, Andreas Ess, Tinne Tuytelaars and Luc Van Gool. Speeded-
Up Robust Features (SURF). Computer Vision and Image Understanding,
110(3), 2008, pp. 346–359.
[BFSO84] Leo Breiman, Jerome Friedman, Charles J Stone and Richard A Olshen.
Classification and regression trees. CRC press, 1984.
[Bish06] Christopher Bishop. Pattern Recognition and Machine Learning. Informa-
tion Science and Statistics. Springer-Verlag, New York. 1 Edition, 2006.
[CBHK02] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip
Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique Nitesh.
Journal of Artificial Intelligence Research, Vol. 16, 2002, pp. 321–357.
[ChLi11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM : A Library for Support
Vector Machines. ACM Transactions on Intelligent Systems and Technology
(TIST), Vol. 2, 2011, pp. 1–39.
[CKZB15] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew Berneshawi, Huimin
Ma, Sanja Fidler and Raquel Urtasun. 3D Object Proposals for Accurate
Object Class Detection. In NIPS, 2015.
[CoHa67] Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classifica-
tion. IEEE Transactions on Information Theory, 13(1), 1967, pp. 21–27.
[CoVa95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine
learning, Vol. 20. Springer, 1995, pp. 273–297.
[DaMa06] Hal Daume III and Daniel Marcu. Domain Adaptation for Statistical
Classiers. Journal of Artificial Intelligence Research, Vol. 26, 2006, pp. 101–
126.
[DARP04] DARPA (Defense Advanced Research Projects Agency). Final
Data from DARPA Grand Challenge. http://archive.darpa.
mil/grandchallenge04/media/final_data.pdf [Accessed December, 18
2015], 2004.
xiv REFERENCES
[DaTr05] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human
Detection. 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’05), Vol. 1, 2005, pp. 886–893.
[DDSL09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database. Computer Vision
and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009,
pp. 248–255.
[DrHo03] Chris Drummond and R.C. Holte. C4.5, class imbalance, and cost sensitiv-
ity: why under-sampling beats over-sampling. Workshop on Learning from
Imbalanced Datasets II, 2003, pp. 1–8.
[Elka01] Charles Elkan. The foundations of cost-sensitive learning. International
joint conference on artificial intelligence, 2001.
[EnAH99] Kjersti Engan, Sven Ole Aase and John Hakon Husoy. Method of optimal
directions for frame design. 1999 IEEE International Conference on Acous-
tics, Speech, and Signal Processing. Proceedings. ICASSP99, Vol. 5, 1999,
pp. 2443–2446.
[FGMR10] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ra-
manan. Object detection with discriminatively trained part-based models.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9),
2010, pp. 1627–1645.
[FrSc97] Yoav Freund and Robert E. Schapire. A Decision-Theoretic Generalization
of On-Line Learning and an Application to Boosting. Journal of Computer
and System Sciences, 55(1), 1997, pp. 119–139.
[FSZC99] Wei Fan, Sj Stolfo, J Zhang and Pk Chan. AdaCost: Misclassification Cost-
Sensitive Boosting. ICML ’99 Proceedings of the Sixteenth International
Conference on Machine Learning, 1999, pp. 97–105.
[Fuku80] Kunihiko Fukushima. Neocognitron: A self-organizing neural network
model for a mechanism of pattern recognition unaffected by shift in po-
sition. Biological Cybernetics, 36(4), 1980, pp. 193–202.
[GeLU12] Andreas Geiger, Philip Lenz and Raquel Urtasun. Are we ready for Au-
tonomous Driving? The KITTI Vision Benchmark Suite. IEEE Conference
on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
[GeWU11] Andreas Geiger, Christian Wojek and Raquel Urtasun. Joint 3D Estimation
of Objects and Scene Layout. Advances in Neural Information Processing
Systems (NIPS), 2011, pp. 1–9.
REFERENCES xv
[Girs15] Ross Girshick. Fast R-CNN. arXiv preprint arXiv:1504.08083, to appear
in ICCV 2015, 2015.
[GLOH11] Michael Gabb, Otto Lohlein, Matthias Oberlander and Gunther Heide-
mann. Efficient monocular vehicle orientation estimation using a tree-based
classifier. 2011 IEEE Intelligent Vehicles Symposium (IV), 2011, pp. 308–
313.
[Hars15] Alexander Hars. Key players in the driverless market. http://www.
driverless-future.com/?page_id=155 [Accessed December 18, 2015],
2015.
[HaTF09] Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements
of Statistical Learning. Springer Series in Statistics. Springer, Berlin. 2
Edition, 2009.
[Hawk04] Douglas M. Hawkins. The Problem of Overfitting. Journal of Chemical
Information and Computer Sciences, 44(1), 2004, pp. 1–12.
[HiOT06] Geoffrey E. Hinton, Simon Osindero and Yee Whye Teh. A fast learning
algorithm for deep belief nets. Neural computation, 18(7), 2006, pp. 1527–
54.
[Hote33] Harold Hotelling. Analysis of a complex of statistical variables into principal
components. In Journal of educational psychology, Vol. 24. Warwick & York,
1933, p. 417.
[HsLi02] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass
support vector machines. Neural Networks, IEEE Transactions on, 13(2),
2002, pp. 415–425.
[Hsu15] Jeremy Hsu. Autonomous Car Sets Record in Mexico. http://spectrum.
ieee.org/cars-that-think/transportation/self-driving/
driverless-car-sets-autonomous-driving-record-in-mexico [Ac-
cessed December 18, 2015], 2015.
[ITSA14] Intelligent Transportation Society of America ITSA. Accelerating Sustain-
ability: Demonstrating the Benefits of Transportation Technology. 2014.
[Japk00] Nathalie Japkowicz. The Class Imbalance Problem: Significance and Strate-
gies. Proceedings of the 2000 International Conference on Artificial Intelli-
gence, 2000, pp. 111–117.
[KoAl01] Aleksander Kolcz and Joshua Alspector. SVM-based Filtering of E-mail
Spam with Content-specific Misclassification Costs. Proceedings of the
xvi REFERENCES
TextDM’01 Workshop on Text Mining - held at the 2001 IEEE Interna-
tional Conference on Data Mining, 2001, pp. 1–14.
[Koha95] Ron Kohavi. A study of cross-validation and bootstrap for accuracy esti-
mation and model selection. In Internation Joint Conference on Artificial
Intelligence (IJCAI), Vol. 14, 1995, pp. 1137–1145.
[KrSH12] Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet Clas-
sification with Deep Convolutional Neural Networks. Advances In Neural
Information Processing Systems, 2012, pp. 1–9.
[KuMa97] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced
training sets: one-sided selection. In ICML, Vol. 97, 1997, pp. 179–186.
[LaSu97] Louisa Lam and Ching Y. Suen. Application of majority voting to pattern
recognition: an analysis of its behavior and performance. IEEE Transac-
tions on Systems Man and Cybernetics Part A Systems and Humans, 27(5),
1997, pp. 553–568.
[Lind98] Tony Lindeberg. Feature Detection with Automatic Scale Selection. Inter-
national Journal of Computer Vision, 30(2), 1998, pp. 79 – 116.
[LiWZ14] Bo Li, Tianfu Wu and Song-Chun Zhu. Integrating context and occlusion
for car detection by hierarchical and-or model. In Computer Vision–ECCV
2014. Springer, 2014, pp. 652–667.
[LiZh05] Yi Liu and Yuan F. Zheng. One-against-all multi-class SVM classifica-
tion using reliability measures. In Neural Networks, 2005. IJCNN ’05.
Proceedings. 2005 IEEE International Joint Conference on, Vol. 2, 2005,
pp. 849–854.
[LiZh06] Xu-Ying Liu and Zhi-Hua Zhou. The Influence of Class Imbalance on Cost-
Sensitive Learning: An Empirical Study. Sixth International Conference
on Data Mining (ICDM’06), 2006, pp. 970–974.
[Lowe99] David G. Lowe. Object recognition from local scale-invariant features. Pro-
ceedings of the Seventh IEEE International Conference on Computer Vi-
sion, 2([8), 1999, pp. 1150–1157.
[MaRS08] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. An
Introduction to Information Retrieval. Cambridge University Press. 2008.
[MaSn13] Kevin Matzen and Noah Snavely. NYC3DCars: A Dataset of 3D Vehicles
in Geographic Context. 2013 IEEE International Conference on Computer
Vision, 2013, pp. 761–768.
REFERENCES xvii
[MBPS09] Julien Mairal, Francis Bach, Jean Ponce and Guillermo Sapiro. Online
dictionary learning for sparse coding. Proceedings of the 26th International
Conference on Machine Learning, 2009, pp. 1–8.
[MoPV12] Douglas C Montgomery, Elizabeth A Peck and G Geoffrey Vining. Intro-
duction to Linear Regression Analysis, Vol. 821. John Wiley & Sons. 2012.
[NPSB10] Jesus Nuevo, Ignacio Parra, Jonas Sjoberg and Luis M. Bergasa. Estimating
surrounding vehicles’ pose using computer vision. 13th International IEEE
Conference on Intelligent Transportation Systems, 2010, pp. 1863–1868.
[OBTr14] Eshed Ohn-Bar and Mohan M. Trivedi. Fast and Robust Object Detection
Using Visual Subcategories. 2014 IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 2014, pp. 179–184.
[OBTr15] Eshed Ohn-Bar and Mohan M. Trivedi. Learning to Detect Vehicles by
Clustering Appearance Patterns. In Intelligent Transportation Systems,
IEEE Transactions on, Vol. 16, 2015, pp. 2511–2521.
[OlFi97] Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete
basis set: A strategy employed by V1? Vision Research, 37(23), 1997,
pp. 3311–3325.
[Pear01] Karl Pearson. On lines and planes of closest fit to systems of points in space.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal
of Science, 2(1), 1901, pp. 559–572.
[PSGS13] Bojan Pepikj, Michael Stark, Peter Gehler and Bernt Schiele. Occlusion
Patterns for Object Class Detection. 2013 IEEE Conference on Computer
Vision and Pattern Recognition, 2013, pp. 3286–3293.
[PSGS15] Bojan Pepikj, Michael Stark, Peter Gehler and Bernt Schiele. Multi-view
and 3D Deformable Part Models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 37(1), 2015, pp. 2232–2245.
[Quin93] J. Ross Quinlan. C4.5: programs for machine learning. Elsevier, 1993.
[ReRa13] Xiaofeng Ren and Deva Ramanan. Histograms of Sparse Codes for Ob-
ject Detection. 2013 IEEE Conference on Computer Vision and Pattern
Recognition, 2013, pp. 3246–3253.
[RHMH10] Paul E. Rybski, Daniel Huber, Daniel D. Morris and Regis Hoffman. Visual
classification of coarse vehicle orientation using histogram of oriented gra-
dients features. IEEE Intelligent Vehicles Symposium, Proceedings, 2010,
pp. 921–928.
xviii REFERENCES
[Sing15] Santokh Singh. Critical reasons for crashes investigated in the National
Motor Vehicle Crash Causation Survey. National Highway Traffic Safety
Administration, (February), 2015, pp. 1–2.
[SiTr13] Sayanan Sivaraman and Mohan Manubhai Trivedi. Looking at Vehicles on
the Road: A Survey of Vision-Based Vehicle Detection, Tracking, and Be-
havior Analysis. IEEE Transactions on Intelligent Transportation Systems,
14(4), 2013, pp. 1773–1795.
[SKWW07] Yanmin Sun, Mohamed S. Kamel, Andrew K.C. Wong and Yang Wang.
Cost-sensitive boosting for classification of imbalanced data. Pattern Recog-
nition, 40(12), 2007, pp. 3358–3378.
[SLJS15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Ra-
binovich. Going Deeper with Convolutions. Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[SmSc04] Alex J. Smola and Bernhard Scholkopf. A tutorial on support vector re-
gression. In Statistics and computing, Vol. 14. Springer, 2004, pp. 199–222.
[SmVi08] Alex Smola and S. V. N. Vishwanathan. Introduction to Machine Learning.
Cambridge University Press. 2008.
[TiBi99] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal
component analysis. Journal of the Royal Statistical Society, 61(3), 1999,
pp. 611–622.
[Ting02] Kai Ming Ting. An instance-weighting method to induce cost-sensitive
trees. IEEE Transactions on Knowledge and Data Engineering, 14(3), 2002,
pp. 659–665.
[TrKa98] Anne M. Treisman and Nancy G. Kanwisher. Perceiving visually presented
objects: Recognition, awareness, and modularity. Current Opinion in Neu-
robiology, 8(2), 1998, pp. 218–226.
[Vapn63] Vladimir Vapnik. Pattern recognition using generalized portrait method.
In Automation and remote control, Vol. 24, 1963, pp. 774–780.
[VaVa98] Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning
theory. Vol. 1. Wiley New York, 1998.
[ViJo01] P Viola and M Jones. Rapid object detection using a boosted cascade of
simple features. Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), Vol. 1, 2001, pp. 511–
518.
REFERENCES xix
[WaHY09] Xiaoyu Wang, Tony X. Han and Shuicheng Yan. An HOG-LBP human
detector with partial occlusion handling. Computer Vision, 2009 IEEE
12th International Conference on, 2009, pp. 32–39.
[WiDe11] Rob G J Wijnhoven and Peter H N De With. Unsupervised sub-
categorization for object detection: Finding cars from a driving vehicle.
Proceedings of the IEEE International Conference on Computer Vision,
2011, pp. 2077–2083.
[Worl15] World Health Organization. Global Status Report on Road Safety. 2015.
[XCLS15] Yu Xiang, Wongun Choi, Yuanqing Lin and Silvio Savarese. Data-Driven
3D Voxel Patterns for Object Category Recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 1903–1911.
[XiMS14] Yu Xiang, Roozbeh Mottaghi and Silvio Savarese. Beyond PASCAL: A
benchmark for 3D object detection in the wild. IEEE Winter Conference
on Applications of Computer Vision, 2014, pp. 75–82.
[YBAL14] J. Javier Yebes, Luis M. Bergasa, Roberto Arroyo and Alberto Lazaro.
Supervised learning and evaluation of KITTI’s cars detector with DPM.
IEEE Intelligent Vehicles Symposium, Proceedings, 2014, pp. 768–773.
[YeBG15] J. Yebes, Luis M. Bergasa and Miguel Angel GarcıaGarrido. Visual Object
Recognition with 3D-Aware Features in KITTI Urban Scenes. Sensors,
15(4), 2015, pp. 9228–9250.
[YTAS11] Quan Yuan, Ashwin Thangali, Vitaly Ablavsky and Stan Sclaroff. Learn-
ing a family of detectors via multiplicative kernels. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 33(3), 2011, pp. 514–530.
[ZhLi10] Qiang Zhang and Baoxin Li. Discriminative K-SVD for dictionary learning
in face recognition. Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, 2010, pp. 2691–2698.
[ZYCA06] Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng and Shai Avidan. Fast
Human Detection Using a Cascade of Histograms of Oriented Gradients.
Computer Vision and Pattern Recognition, 2006 IEEE Computer Society
Conference on, Vol. 2, 2006, pp. 1491–1498.