Estimation of Vehicle Orientations using Histogram-based Image … · 2016. 1. 25. · Ste en...

Institut fur Angewandte Informatik

und Formale

Beschreibungsverfahren

Estimation of Vehicle Orientations using

Histogram-based Image Descriptors

Masterarbeit

von

Steffen Kirres

an der Fakultat fur

Wirtschaftswissenschaften

in dem Studiengang

Informationswirtschaft

eingereicht am 14. Januar 2016 beim

Institut fur Angewandte Informatik

und Formale Beschreibungsverfahren

des Karlsruher Instituts fur Technologie

Referent: Prof. Dr. Rudi Studer

Betreuer: Prof. Dr. Brahim Chaib-draa

KIT – Universitat des Landes Baden-Wurttemberg und

nationales Forschungszentrum in der Helmholtz-Gemeinschaft

Eidesstattliche Erklarung

Ich versichere hiermit wahrheitsgemaß, die Arbeit und alle Teile daraus selbstandig

angefertigt, alle benutzten Hilfsmittel vollstandig und genau angegeben und alles ken-

ntlich gemacht zu haben, was aus Arbeiten anderer unverandert oder mit Abanderung

entnommen wurde.

Karlsruhe, den 14. Januar 2016 Steffen Kirres

CONTENTS v

Contents

List of Figures ix

List of Tables xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 3

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Learning Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Prediction Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Driver Assistance Systems . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Orientations of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Survey on Orientation Estimators for Vehicles . . . . . . . . . . . . . . 10

3 Histogram-based Image Descriptors 13

3.1 Histograms of Oriented Gradients . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Image Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Histogram Calculation . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Histograms of Sparse Codes (HSC) . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Patch Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . 17

vi CONTENTS

3.2.3 Sparse Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.4 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.5 Dimensionality of HSC . . . . . . . . . . . . . . . . . . . . . . . 22

4 Estimation of Vehicle Orientations 23

4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . . . . 24

4.1.3 Combined Classification and Regression . . . . . . . . . . . . . 25

4.2 Classifier Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Weighted Voting . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.2 Specialized Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.3 Second Layer Probability Classification . . . . . . . . . . . . . . 29

4.3 Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 Under-sampling Data Points . . . . . . . . . . . . . . . . . . . . 31

4.3.2 Over-sampling Data Points . . . . . . . . . . . . . . . . . . . . . 33

4.3.3 Mirroring Images . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.4 Changing the Value Distribution . . . . . . . . . . . . . . . . . 35

5 Experiments 37

5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.1 KITTI Vision Dataset . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Orientation Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.1 Histogram-based Features . . . . . . . . . . . . . . . . . . . . . 42

5.2.2 Regression vs. Classification . . . . . . . . . . . . . . . . . . . . 47

CONTENTS vii

5.2.3 Results for Classifier Modifications . . . . . . . . . . . . . . . . 51

5.3 Joint Object Detection and Orientation Estimation . . . . . . . . . . . 52

5.3.1 Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.2 Submission Results . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Conclusion & Future Work 59

References xiii

LIST OF FIGURES ix

List of Figures

1 Detected vehicles from forefield image. . . . . . . . . . . . . . . . . . . 9

2 Pascal3D vehicle pose definition. . . . . . . . . . . . . . . . . . . . . . . 10

3 Gradient visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Example image with associated histogram of gradients. . . . . . . . . . 14

5 Car image with associated HOG per cell. . . . . . . . . . . . . . . . . . 15

6 Sliding through cells to create blocks. . . . . . . . . . . . . . . . . . . . 16

7 Patch preprocessing steps. . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 Learned dictionary of frequent patches. . . . . . . . . . . . . . . . . . . 19

9 Reconstruction with dictionary elements. . . . . . . . . . . . . . . . . . 19

10 Sliding through the image to extract image patches. . . . . . . . . . . . 21

11 Construction of the HSC feature. . . . . . . . . . . . . . . . . . . . . . 21

12 Factors deciding the feature length of HSC. . . . . . . . . . . . . . . . . 22

13 Angle representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

14 Angles divided into 16 classes. . . . . . . . . . . . . . . . . . . . . . . . 25

15 Process of combined method. . . . . . . . . . . . . . . . . . . . . . . . 26

16 Images per class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

17 Quasi-duplicates of consecutive frames. . . . . . . . . . . . . . . . . . . 32

18 Image mirroring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

19 Sensors on recording car [GeLU12]. . . . . . . . . . . . . . . . . . . . . 38

20 Exemplary forefield images. . . . . . . . . . . . . . . . . . . . . . . . . 39

21 Evaluation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

22 Cropping cars using the provided annotations. . . . . . . . . . . . . . . 42

23 HOG results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

x LIST OF FIGURES

24 Examples of badly predicted images. . . . . . . . . . . . . . . . . . . . 44

25 Neighborhood of badly predicted image. . . . . . . . . . . . . . . . . . 45

26 HSC results with 256 dictionary elements. . . . . . . . . . . . . . . . . 45

27 Dictionary length vs. accuracy vs. time. . . . . . . . . . . . . . . . . . 46

28 Comparison of regression results. . . . . . . . . . . . . . . . . . . . . . 48

29 Tuning the cost parameter for the linear kernel. . . . . . . . . . . . . . 49

30 Comparison of regression and classification. . . . . . . . . . . . . . . . . 50

31 Results of combined classification and regression. . . . . . . . . . . . . 51

32 Joint detection and estimation orientation. . . . . . . . . . . . . . . . . 53

LIST OF TABLES xi

List of Tables

1 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Responses of one-vs-one classifiers. . . . . . . . . . . . . . . . . . . . . 6

3 Multi-Class approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Explanation of dictionary learning variables. . . . . . . . . . . . . . . . 18

5 Exemplary angles and orientations. . . . . . . . . . . . . . . . . . . . . 23

6 Example for weight matrix. . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Responses of six-class problem. . . . . . . . . . . . . . . . . . . . . . . 28

8 Responses of specialized classifiers. . . . . . . . . . . . . . . . . . . . . 28

9 Probability predictions of first layer classifiers. . . . . . . . . . . . . . . 29

10 Characteristics of methods to balance classes. . . . . . . . . . . . . . . 34

11 Difficulties of the vehicle images. . . . . . . . . . . . . . . . . . . . . . 39

12 Comparison of HOG and HSC. . . . . . . . . . . . . . . . . . . . . . . 47

13 Comparison of regression and classification methods. . . . . . . . . . . 51

14 Comparison of classifier modifications. . . . . . . . . . . . . . . . . . . 52

15 Results of trained detector. . . . . . . . . . . . . . . . . . . . . . . . . . 55

16 Results with provided detection. . . . . . . . . . . . . . . . . . . . . . . 56

17 Comparison of orientation estimators. . . . . . . . . . . . . . . . . . . . 56

INTRODUCTION 1

1 Introduction

Researchers have already worked on the development of autonomous cars for many

years. One of the first larger events in this field was the DARPA Grand Challenge in

2004 [DARP04]. None of the teams finished the race but since then a lot of progress

has been made. Nowadays, a lot of researchers from academia, traditional car makers

and IT companies are working on autonomous cars [Hars15]. Recently, success stories

from varying researchers drew public interest onto these topics (for instace [Hsu15]).

On the one hand, autonomous cars driving interconnected with other autonomous cars

can be beneficial for the environment because they could lead to a more constant flow

of traffic and therefore preventing jams and reducing carbon emissions [ITSA14]. On

the other hand, a main cause of accidents is human failure [Sing15]. The reasons for

failure range from fatigue, alcohol consumption to distraction caused by the use of

electronic devices which poses a growing problem [Worl15]. Autonomous cars are not

getting tired or lose their focus and can hence be valuable to reduce the amount of

accidents.

An autonomous vehicle consists typically of a multitude of subsystems and is equipped

with different sensors such as cameras or radars. Each subsystem should be able to work

independently and deliver reliable information. To ensure a failsafe system – even in

case of a failure of one subsystem – the information from the single subsystems should

be fused together. Thus, the redundant information can enable a reliably working

system.

1.1 Motivation

One of these subsystems must be aware of the surrounding cars at any given time.

Not only the current position is of importance, but also the system should be able

to predict the positions of the surrounding cars in the next moment. Knowing where

surrounding cars will be located can be a useful information not only in the context

of autonomous cars. If the technology is embedded in a breaking assistance system, it

can then inform the human driver of potentially dangerous situations.

Nowadays, breaking assistants detect other vehicles through distance sensors. These

sensors however are directed to the front only and warn the driver in the case that a

vehicle is too close in the direct forefield of the car. Vehicles, which are not yet directly

in front of another vehicle but are about to cross this vehicle’s driving lane, are not

detected. For instance, a car which is changing the lane directly in the immediate

2 INTRODUCTION

forefield might be detected too late by a conventional breaking assistant potentially

resulting in an accident.

A breaking assistant which can predict the future position of a vehicle is able to warn

the driver just a split second earlier giving the human driver or the autonomous system

enough time to hit the breaks in order to prevent a crash. Employing such a breaking

assistant can reduce the risks which arise from unforeseen lane changes. This kind

of breaking assistant needs to incorporate the orientation as well as the velocity and

distance to the surrounding vehicles.

1.2 Objectives

The goal of this thesis is to develop a model which can reliably predict the orientation

of a vehicle. This model is based on vision and uses only color images. In this context,

different image descriptors and prediction approaches shall be investigated, compared

and improved.

Experiments on an existing data set shall show that the orientation estimation is work-

ing not only for cropped car images but also in conjunction with any independent object

detector. This detector has to inform the orientation estimator only about the location

of a vehicle in a recorded image.

1.3 Thesis Organization

Section 2 gives an overview of the subject and includes a survey on previous work

for the estimation of vehicle orientations. In Section 3, the fundamentals for two im-

age descriptors, which are employed throughout this thesis, are explained. Different

approaches for solving the prediction of vehicle orientations are given in Section 4.

Problems arising from the used data set as well as modifications of existing classifica-

tion approaches are described in detail in that section as well. Section 5 is reporting

the extensive experiments which were conducted for this work and also a detailed de-

scription of the data set is given. In Section 6 the work is concluded by a summary of

the results and some remarks for future work.

RELATED WORK 3

2 Related Work

This section gives a brief overview on several related topics. Firstly, the terminology

and different methods in machine learning are introduced. Secondly, it is discussed

which information on surrounding cars are useful for driver assistance systems. Fur-

thermore, two ways of describing the orientation of a vehicle are given and a survey on

orientation estimators for vehicles is conducted.

2.1 Machine Learning

As this work is using machine learning techniques, this section is briefly describing the

different learning styles, tasks and approaches related to this topic. The focus is only

upon those parts which are used in the remainder of this work.

2.1.1 Learning Styles

In machine learning, problems can be differentiated according to the style of learning

which is used to solve a certain problem. The differentiation can be made according to

whether there is an involvement of a human present or not.

2.1.1.1 Supervised Learning. In this learning style, a supervisor defines classes

and labels of training data, therefore it is called supervised [MaRS08]. With the ac-

cordingly labeled training data, a model shall be learned which can make an accurate

prediction [Barb10]. It is particularly of interest that this model predicts well unseen

data.

The typical prediction tasks which belong to this style are classification and regression

(see Section 2.1.2).

2.1.1.2 Unsupervised Learning. A different learning style which focuses on find-

ing an accurate compact description of data is unsupervised learning [Barb10]. For this,

no labeled data is used and it is also not necessary because the goal is to find out how

the data is organized or clustered [HaTF09].

According to [MaRS08], the most common form of unsupervised learning is clustering.

In this task, areas with an accumulation of data points need to be identified, so that

the points can be grouped into clusters.

4 RELATED WORK

2.1.2 Prediction Tasks

Another way to differentiate is to consider what kind of task is solved to reach a certain

goal. In this work, a model has to be learned which can predict a value given an unseen

data point. Therefore only classification and regression tasks are described here. The

main difference between these two tasks lies in the output they produce.

2.1.2.1 Classification. In a classification task, the learned model predicts a dis-

crete value or a class [Barb10]. Also, the training examples are labeled with such a

discrete value or class. If the labels only contain two distinct classes, it is referred to as

binary classification [SmVi08]. The prediction therefore only answers whether a data

point belongs to one or the other class. If more than two classes exist, it is referred to

as multi-class classification [SmVi08] (see Section 2.1.2.2).

The prediction of a classifier can be either correct or incorrect, but the incorrect predic-

tions can further be differentiated into false positives and false negatives. It is typically

illustrated in a confusion matrix (see Table 1). In the depicted example, a false positive

occurs when a data point actually belongs to class B but is predicted as an instance of

class A. In the case that a data point belongs to class A but is predicted as class B, it

is called a false negative. The correct predictions are differentiated into true positives

and true negatives.

Table 1: Confusion matrix.

Truth

Class A Class B

PredictionClass A true positives false positives

Class B false negatives true negatives

This differentiation is necessary because in some cases a false negative can be a lot more

severe than a false positive. This can be explained by means of an earthquake warn-

ing system. The outcome of predicting “no earthquake” when in fact an earthquake

occurred is a lot worse than predicting an earthquake when in reality there has not

been an earthquake. The first case is a false negative, the second one a false positive.

When a false negative occurs, people would not take the necessary precautions which

could have fatal effects for them. False positives do not have this kind of bad outcome

in this particular scenario.

It is possible to associate costs for false negatives and false positives when learning a

classifier. This is referred to as misclassification cost [Elka01] and has been studied for

different models, [FSZC99] for AdaBoost, [KoAl01] for SVM and [Ting02] for decision

RELATED WORK 5

trees. In the earthquake scenario, the cost associated with false negatives should be

a lot higher than for the false positives. It results in a more conservative classifier

which is more likely to predict false positives, however, the number of false negatives

is reduced, ideally to zero. The property of conservatism is used in Section 4.2.2.

2.1.2.2 Multi-Class Classification. In the presence of more than two classes, the

afore mentioned classification task can be extended by using multiple binary classifiers.

Multi-Class Classification can be achieved by methods like one-vs-one or one-vs-rest.

Some more notes on these methods are following in this section, however, a detailed

investigation concerning these methods is not conducted in this work but can be found

in [HsLi02].

A. The one-vs-rest Approach. In this approach, all data points are divided into

two sets. One set which contains only the data points from one class while the other

contains the leftover data points from the other class [Bish06]. These two sets are con-

sidered as the new two classes. Then, a classifier is learned which is able to differentiate

between these two classes and the response of the classifier is a confidence value. This

value tells how certain the classifier is that the predicted class was seen.

This division is being done multiple times, so that every class was once in the small

subset. Assuming that the classes 1,2 and 3 are present, three classifiers have to be

learned: 1-vs-(2 and 3), 2-vs-(1 and 3) and 3-vs-(1 and 2). The final prediction can be

done by using the prediction with the highest confidence value.

A problem which can arise from this approach is that when grouping two classes to-

gether, the amount of data points in this grouped class can be a lot larger than the

amount of data points in the single class. It results in class imbalance which is also

addressed in Section 4.3. Furthermore, the confidence values of every single classifier

is scaled differently and not directly comparable which can be a drawback. [LiZh05]

introduces a method which can overcome this issue.

B. The one-vs-one Approach. For this approach only a subset of the complete

dataset is considered in the learning stage. This subset contains all data points from

only two classes, it can then be considered a binary classification problem. A classifier is

learned which can discriminate between these two classes. This procedure is repeated

for all possible combinations of class pairs. As a consequence, for the one-vs-one

approach k ∗ (k− 1)/2 classifiers have to be learned where k is the amount of different

classes. Assuming that four classes exist, thus k = 4, it follows that 6 classifiers need

to be learned.

6 RELATED WORK

When a prediction has to be made with the one-vs-one approach, the new data point

is handed to every single classifier created earlier. The responses of the classifiers can

be taken from Table 2, for instance, the 1-vs-2 classifier is responding that an example

of class 2 has been detected. The final decision, which class to use, can be made by

the majority rule described in the following section.

Table 2: Responses of one-vs-one classifiers.

Class

1 2 3 4

Class

1 - 2 3 4

2 - - 2 2

3 - - - 4

4 - - - -

C. Majority Voting. This is a popular approach not only in democratic procedures.

Even though it is a simple rule, it was shown that good results can be achieved with it

[LaSu97]. The rule says that the class which has the most accumulated votes is chosen

as the final prediction. In the example given in Table 2, class 2 would be chosen as

final prediction because it has three votes. Class 1 has no votes, class 3 has only one

vote and class 4 has two votes.

In the case that two or more classes have the same amount of votes, the class with

the smallest index can be taken. Instead of applying the smallest index rule, it is also

conceivable to choose one class among the classes with the most votes at random.

Table 3 gives an overview of the explained approaches for Multi-Class Classification

problems.

Table 3: Multi-Class approaches.

Approach Classifiers Decision Rule

One-vs-rest k Highest confidence value

One-vs-one k ∗ (k − 1)/2 Majority rule plus smallest index

2.1.2.3 Regression. The goal of the regression is to predict a continuous value

[SmVi08] whereas classification predicts a discrete or categorical value. Based on the

labeled training data, the regression model is learned, therefore it is also a supervised

technique. The learned regression model needs a vector as an input and outputs one

value. This value is the prediction of the model.

RELATED WORK 7

Different approaches exist to learn a model. In statistics, a very common way to

determine a regression function is to fit a linear regression model with the least square

method [MoPV12]. Other popular approaches in the area of machine learning are

described in the following section.

2.1.3 Algorithms

For classification and regression tasks, a plethora of different approaches and algorithms

exist. Here, only the ones, which are used in the experimental part of the work, are

briefly introduced.

2.1.3.1 Support Vector Machines (SVM). This method dates back to a work

from the sixties, they are described in [Vapn63]. With SVM a decision boundary, or

hyperplane, can be learned which separates data points. To ensure that this boundary

is generalizing well for unseen data, [Vapn63] suggests that the margin between the

boundary and the closest data points shall be maximized.

The problem formulation for a binary classification problem as described in [CoVa95]

can be taken from Formula (1). The xi’s are the feature vectors, the yi’s contain the

class assignment, and ξi’s are slack variables. Slack variables are necessary to make the

problem solvable when the dataset is not linearly separable which is often the case. ξi

corresponds to the violation of the i-th constraint.

minw,b,ξ

1

2wTw + C

l∑i=1

ξi

subject to yi(wTφ(xi) + b) ≥ 1− ξi,

ξi ≥ 0, i = 1, . . . , l.

(1)

2.1.3.2 Support Vector Regression (SVR). As an extension of Support Vector

Machines (SVM), the SVR is used for solving regression problems. [VaVa98] introduced

a version for SVR which can be taken from Formula (2). The notation is similar to the

one used before for SVM, however for SVR the zi’s contain the target values which are

continuous.

minw,b,ξ,ξ∗

1

2wTw + C

l∑i=1

ξi + Cl∑

i=1

ξ∗i

subject to wTφ(xi) + b− zi ≤ ε+ ξi,

zi − wTφ(xi)− b ≤ ε+ ξ∗i ,

ξi, ξ∗i ≥ 0, i = 1, . . . , l.

(2)

8 RELATED WORK

The goal of the minimization problem consists of two parts. One being the minimization

of the sum of constraint violations which are multiplied by a regularization constant

C. The second goal is to minimize the quadratic norm wTw which, together with the

other constraints, corresponds to maximizing the smallest distance to the hyperplane.

Solving the above stated problem should find a function, or the support vectors which

span the function. All the points should be in the ε-area above or underneath the

function [SmVi08]. This is expressed by the two first constraints. The main difference

between SVM and SVR is that for SVM a function is learned which separates two

classes, for SVR however a function is learned which, in the best case, contains all the

data points in its ε-area [SmSc04].

2.1.3.3 Decision Trees. This method is an easy and intuitive way to determine

decisions or to group and separate data. To determine decisions, the tree consists

of nodes with conditions. Depending on whether the condition is fulfilled a different

branch has to be followed until a leaf is reached. The leaf contains the final decision.

Trees can also be used to make decisions. The data points are separated in a way

that the two new resulting datasets have a high intra-class similarity and a low inter-

class similarity. This means that instances from one dataset should be similar to the

instances of the same class but instances from different datasets should be very different

from each other. In each step of the process of building decision trees, the splitting

attribute and value has to be chosen which ensures these conditions best.

Trees were introduced for Classification and Regression tasks by [BFSO84]. In this

section, the focus is on regression trees which are able to predict values from a certain

range. However, the model can only predict values which have been taught by at least

one example in the training stage. Values with no exemplary data points are not being

predicted by the model. Thus, the predictions cannot include every single continuous

values in a certain range and are limited to a discrete yet large subset of numbers in

the range.

2.2 Driver Assistance Systems

For autonomously driving cars or driver assistance systems, it is of great importance

that all information on surrounding vehicles is available. Not only the current position

but also the future position of these vehicles constitutes an important information. To

predict the future position of a surrounding vehicle, it is necessary to know, firstly,

RELATED WORK 9

where the vehicle is currently located; secondly, in which direction the vehicle is head-

ing; and thirdly, how fast the vehicle is driving.

1. The research on object detection addresses the first issue and answers the question

where a vehicle is located at the current time. Exemplary results of an object

detector with the bounding boxes of found vehicles can be taken from Figure 1.

2. The answer on how the driving direction of a vehicle can be determined depends

on what kind of data is used. If sequential data is available, object tracking

or optical flow approaches can help solving the problem. For this, an object is

followed in successive frames and from the differences in the sequential frames

the velocity and the driving directions can be inferred. If, however, no such

sequential data is available, the direction has to be determined by only using a

single frame. Determining the orientation of a certain vehicle in a single frame is

referred to as vision-based orientation estimation.

3. For the third issue, the determination of vehicle velocities, again approaches from

the field of object tracking and optical flow can be used. Sequential images are

mandatory for these vision-based approaches. However, velocities can also be

measured using other techniques, for example, by employing a LIDAR (Light

Detection and Ranging) system which can measure distances and velocities of

the surrounding cars.

Figure 1: Detected vehicles from forefield image.

This work focusses on orientation estimation, therefore only approaches related to this

area are described in this section. A detailed survey on approaches in the areas of

object detection, tracking and behavior analysis can be taken from [SiTr13].

2.3 Orientations of Vehicles

The orientation of a vehicle in an image can be described in different ways. When

using a more detailed description, it is referred to as vehicle pose, whereas a simpler

definition is referred to as a vehicle orientation. The definitions are as follows:

10 RELATED WORK

1. In [XiMS14], the pose is described as a combination of three values: the azimuth,

the elevation and the distance (see Figure 2)1. The azimuth value implies from

which side the car is seen. The elevation angle indicates whether the vehicle

image was taken from the top or the bottom. And the distance value tells how

far away the vehicle was from the recording camera.

2. When it is only of interest which side of a car is seen, only one value, the ob-

servation angle, is necessary. The observation angle is equivalent to the azimuth

in the pose definition. For the prediction challenge which is described in Section

5.3, only the observation angle needs to be predicted and also most of the below

described related work is using this one-value definition of vehicle orientations in

their approaches.

Figure 2: Pascal3D vehicle pose definition.

2.4 Survey on Orientation Estimators for Vehicles

In a lot of existing approaches, the estimation of vehicle orientations is solved simul-

taneously while detecting the vehicle. HOG [DaTr05], Haar-like features [ViJo01] and

Gabor filter are often used together with SVM [Vapn63] or AdaBoost [FrSc97] for the

detection and therefore also for the orientation estimation. Different approaches which

are related to the estimation of vehicle orientations are covered below.

In [RHMH10], a LIDAR is used to detect a vehicle candidate, HOG features are then

being computed for the candidate. The orientation is determined with a multi-class

SVM but only eight distinct viewpoints are used. In [NPSB10], the authors are de-

tecting the lights of the surrounding vehicles. By tracking the changes in the geometry

of the lights, the orientation of a vehicle can be estimated. As tracking is involved,

sequential frames from a video are necessary.

1http://cvgl.stanford.edu/projects/pascal3d.html [Accessed December 18, 2015]

http://cvgl.stanford.edu/projects/pascal3d.html

RELATED WORK 11

[GLOH11] produces two bounding boxes for each detected vehicle. The outer bounding

box comprises the whole extent of the vehicle, the inner bounding box only includes the

corresponding rear or front section, e.g. from left to the right taillight. By analyzing

the relative positions of the two bounding boxes, the vehicle orientation can be esti-

mated using a tree-based classifier. [GeWU11] is inferring a scene layout and estimate

orientations by aligning the tracklets with detected lanes. However, this approach only

works for moving cars.

[WiDe11] is learning detectors for eight distinct viewpoints by using HOG features

together with a simple linear classifier. The detector with the highest detection score

determines the orientation of the vehicle. [YTAS11] is simultaneously doing object de-

tection and orientation estimation by training a family of detectors using multiplicative

kernels, HOG features and SVM classification.

[YBAL14] is using a discriminatively trained part-based model (DPM) [FGMR10]. For

this model, parts of vehicles are detected and if certain parts occur together in a certain

way a vehicle detection can be derived. The observation angles are grouped into 16

different viewpoints and for each viewpoint a set of parts is learned. The parts are

described using HOG features. In [YeBG15], the DPM was extended to incorporate

also 3D information by using stereo images.

A DPM with HOG features is used in [PSGS13] where also occlusion of vehicles is

considered. Occlusion patterns are detected and depending on the finding of those

patterns, the detection of occluded vehicles is improved. The work is extended in

[PSGS15] where a 3D-DPM model is used for the detection, but only eight different

viewpoints are taken into account for the orientation estimation. Again, HOG features

are used to describe the parts. Occlusion is also considered in [LiWZ14] where the

configurations of parts and components are modeled by an AND-OR structure with

varying viewpoint-occlusion patterns.

In [OBTr14], a clustering on color and gradient features is performed. Detectors for

subcategories of varying orientations, occlusions and truncations are learned. The

detectors are trained using AdaBoost. This work was extended in [OBTr15], where

the detection scores of the single detectors were used as a feature to determine the

orientation.

[XCLS15] creates exemplar 3D models which vary in the observation angle, the oc-

clusion and the truncation. Then a clustering for every exemplar model is conducted

resulting in a set of images for every single exemplar model. The images from a cluster

are used to train a detector which is then able to predict typical patterns seen in the

cluster. As for every exemplar model a detector is created, multiple detectors exist. If a

12 RELATED WORK

detector triggers a detection, besides the orientation angle, also the degree of occlusion

and truncation is predicted.

Recently, Deep Learning approaches [HiOT06], [KrSH12], [SLJS15] performed very

well for computer vision tasks. An advantage of such deep networks is that no hand-

engineered features are necessary, the features are learned by the network itself. How-

ever, for the estimation of vehicle orientations not a lot of work is published yet.

[CKZB15] is learning the location of an object and its orientation simultaneously using

the Fast Region-based CNN [Girs15].

In most of the above mentioned work, the orientation is a directly inferred from the

detection and the problem of orientation estimation is not considered as an independent

problem. The advantage of solving the orientation estimation independently is that

it can be plugged onto any object detector, therefore modularity is ensured. So if,

for example, a better detector is developed, no changes to the orientation estimator

need to be made. The new detector can simply be used together with the original

orientation estimator. The developed orientation estimator in this work requires, apart

from the bounding box of the detected vehicle, no prior information from the detectors

to estimate the vehicle orientation.

HISTOGRAM-BASED IMAGE DESCRIPTORS 13

3 Histogram-based Image Descriptors

Understanding an image comprises the tasks of object detection and orientation estima-

tion as explained in the previous chapter. If a machine should be taught to execute these

tasks automatically, the image needs to be represented in a machine-understandable

way.

An image is typically represented as a 3-dimensional matrix of pixel values in the

RGB color space. Concatenating these matrices to a vector would not only result in

a large vector but also this vector would not be very suitable for the prediction task.

It is necessary to describe images in an abstract way for a machine to understand.

This representation should ideally capture all the necessary information and should

be discriminative enough to distinguish images of different kinds. The representation

of an image as a vector is called image descriptor or generally a feature vector. This

feature can be engineered (Feature Engineering), or automatically generated (Feature

Learning).

Studies have shown that the human brain detects objects by recognizing shapes and

edges [TrKa98]. The same approach is followed when teaching a machine to under-

stand an image. Thus, a plethora of different feature descriptors exist which are using

edges. There are descriptors which are keypoint-based, like SIFT [Lowe99] and SURF

[BETV08]. In this work, the focus is on descriptors which aggregate the number of

edges from different orientations. They are described in the following subsections.

3.1 Histograms of Oriented Gradients

Histograms of Oriented Gradients (HOG) [DaTr05] were introduced for the detection

of humans but became popular also for the detection of other objects. Further work

was conducted to extend and improve the performance of HOG in [ZYCA06] and

[WaHY09].

3.1.1 Image Gradients

Image gradients are measuring the change in intensity in a specific direction. For every

pixel in an image, it can be calculated in which direction the intensity and consequently

the color changes the fastest. Supposing a gray scale image, the gradient points into

the direction where the image is getting darker fastest. In Figure 3, the gradients are

visualized by red arrows. In the image, the pixel values are increasing towards the

middle, thus the gradients point towards the middle plane from both sides.

14 HISTOGRAM-BASED IMAGE DESCRIPTORS

Figure 3: Gradient visualization.

If two neighboring pixels have almost the same gray value the gradient is small. If the

neighboring pixels are black and white the gradient is large. Implicitly, this means that

an image gradient is able to capture the edges contained in an image.

3.1.1.1 Oriented Gradients. Gradients come with a direction, the gradients in

Figure 4a point either to the bottom or the top. However, it was found in [DaTr05]

that it does not necessarily improve the results if the direction is used. Rather the

plane of the gradient is of interest. For example, the gradients pointing to the bottom

or the top are both in the vertical plane and can be aggregated. If the exact directions

are being used, it is referred to as signed gradients. When the gradients are summed

up in a plane, it is referred to as unsigned gradients.

Furthermore, it was found that choosing nine of these planes or orientations is perform-

ing best. This means that the histogram consists of nine bins and each bin contains a

range of gradients. Each orientation range is 180◦

9= 20◦ wide.

(a) Stripe image. (b) Histogram of gradients.

Figure 4: Example image with associated histogram of gradients.

3.1.2 Histogram Calculation

By summing up all gradients from a specific orientation, a histogram is obtained. Each

bin of the histogram contains the amount of gradients from this orientation. E.g., using

nine orientations results in a histogram with nine bins. In the case of signed gradients,

the amount of gradients to the left, to the right, etc. is computed. In the case of


unsigned gradients, the amount of horizontal, vertical, diagonal gradients is summed

up.

Considering the stripe image from above, all gradients are in the vertical orientation.

So, only the bin which contains the gradients from the vertical orientation has a non-

zero value. The resulting histogram can be taken from Figure 4b, the fifth bin contains

all the gradients.

It is obvious that a single histogram with nine orientations cannot be enough to describe

an image entirely because such a descriptor would lack discriminative power. The

resulting histogram would be quite similar for any natural image. To overcome this

problem the image is separated into cells.

(a) Separation into cells. (b) Histograms of each cell.

(c) Alternative representation as roses.

Figure 5: Car image with associated HOG per cell.

The car image was divided into 4∗6 = 24 cells (see Figure 5a). For every cell a histogram

of oriented gradients is being computed, resulting in 24 histograms depicted in Figure

5b. Often for HOG an alternative representation for the histograms is used, this rose-

like representation makes it easier to recognize the direction with the predominant

gradient plane, as shown in Figure 5c.

Additionally, the cell histograms are grouped into blocks, for example, one block con-

tains the histograms of four adjacent cells. Multiple blocks of cells are created by

sliding through the image cell per cell, as illustrated in Figure 6. Each block then con-

tains the concatenation of four histograms and at the end all blocks are concatenated

to one feature vector. Grouping the cells into blocks and applying this sliding window

approach increases the robustness and improves the generated HOG feature [DaTr05].


I II III

Figure 6: Sliding through cells to create blocks.

The dimensionality of the final feature vector depends consequently of the amount of

blocks (here fifteen, five blocks horizontally and three blocks vertically), the amount

of cells in each block (here four), and the amount of histogram bins in each cell (here

nine). In the given example, the resulting feature length is 15 ∗ 4 ∗ 9 = 540.

3.2 Histograms of Sparse Codes (HSC)

This section describes an image descriptor which is related to HOG but instead of

using gradients the approach is using sparse codes. The feature is called Histograms of

Sparse Codes (HSC) and was introduced in [ReRa13]. The original work used HSC to

perform object detection like it was the case for HOG and the authors reported better

detection results for HSC compared to HOG.

The HOG features detects edges by analyzing the gradients whereas the HSC features is

detecting edges by analyzing image patches. An image patch is a small image which can

be, for example, 3x3 or 8x8 pixels large. Both approaches are using histograms, HOG

is summing up gradients from different directions, HSC is summing up the occurrence

of certain image patches. Learning these image patches is addressed in Section 3.2.2.

In this work, a modified version of HSC is being used, it is described in the following

sections.

3.2.1 Patch Preprocessing

Assuming two image patches with a horizontal edge but one of them is darker and

the other one brighter as depicted in Figure 7a. When comparing these patches pixel-

wise one could argue that most pixel values are very different from each other. Only

the horizontal edge is a similarity and in a pixel-wise comparison the impact of this

similarity might not be strong enough. The result of the comparison would be that

both images are not similar and the fact that both patches have a horizontal edge

would be missed.

This is the reason why preprocessing the image patches is necessary to make the edges

more clearly visible. [ReRa13] is calculating the mean pixel value for every patch and


(a) Original patches. (b) Centered patches. (c) Plus normalization.

Figure 7: Patch preprocessing steps.

is subtracting it from the regarded patch. This is also called centering and depicted

in 7b. Note that centering can introduce negative pixel values and the values have to

be rescaled to valid values to visualize them. This work is, in addition to centering,

normalizing the patches which leads to more uniform patches as can be seen in 7c.

It is important that this preprocessing has to stay the same for every extracted patch

in all stages, namely the learning of the dictionary, the training of the model as well

as for the prediction of test images. Skipping or modifying the preprocessing only for

one stage can result in an undesired behavior and reduced results.

3.2.2 Dictionary Learning

The first step when working with Sparse Codes is to learn a dictionary of visual words.

These visual words are the image patches described in the preceding section. The

dictionary should include those image patches which occur most often but also have a

certain degree of representational power to distinguish them from other patches in the

dictionary [ZhLi10].

[OlFi97] introduced a dictionary learning approach which is using stochastic gradient

descent to update the dictionary in each step. [EnAH99] is learning a dictionary by

alternating between the optimization of the dictionary and the sparse codes. The

popular K-SVD algorithm [AhEB06] is updating the sparse codes while the update of

the dictionary is being conducted. [MBPS09] propose an online-approach which can

learn dictionaries fast and is capable of updating the dictionary when new images are

seen.

Finding a suitable dictionary can be formulated as an optimization problem, this work

is using the formulation described in [MBPS09], see Equation (3). Given a dictionary

D, the error ‖xi −Dαi‖22 between image patch xi and the reconstruction of this patch


by using a subset of dictionary elements should be minimized. A detailed example for

patch reconstruction is given in Section 3.2.3.

minD∈C,α

1

n

n∑i=1

1

2‖xi −Dαi‖22

subject to ‖αi‖0 ≤ λ.

(3)

αi is a sparse vector and includes the information on which subset of dictionary elements

shall be used. If α(j)i is zero, the j-element of the dictionary is not being used. If α

(j)i

is non-zero, the j-th element of the dictionary is used with the weight included in

α(j)i . The ‖ · ‖0-norm is equal to the number of non-zero entries of a vector. Thus,

the constraint ‖αi‖0 ≤ λ ensures that αi does not contain more non-zero entries than

determined with the sparsity level λ.

Table 4 gives an overview of the beforehand described variables.

Table 4: Explanation of dictionary learning variables.

Variable Explanation

D Dictionary

C Dictionary constraint (C = {D ∈ Rm×p : ‖dj‖22 ≤ 1,∀j})xi Preprocessed i-th image patch (extracted from the original image)

‖αi‖0 Amount of non-zero entries in sparse vector αi

λ Sparsity level

This optimization problem can be solved by alternating between the computation of α

and D [ReRa13]. This means that first a random dictionary can be assumed, then the

sparse codes which minimize the problem are computed. Afterwards, the dictionary

is modified to check if a better value for the objective function can be found. This is

being done iteratively for a determined amount of iterations.

By solving the problem, the image patches with the smallest reconstruction error are

identified. Thus, the found image patches are the most frequently used patches in the

used subset of images. This subset should represent well all existing angles that exist

in the dataset. An exemplary dictionary with the most frequent 64 patches from a

subset of vehicle images can be found in Figure 8.

3.2.3 Sparse Reconstruction

After a dictionary was learned, every arbitrary image patch can be reconstructed by

using the dictionary elements. Clearly, the goodness of reconstruction depends on how


Figure 8: Learned dictionary of frequent patches.

close the regarded image patch is to the learned elements. Most of the time it is not

an exact reconstruction but rather an approximation of the original regarded patch. It

can be formulated, again, as an optimization problem [MBPS09] which can be taken

from Equation (4).

minα

‖x−Dα‖22

subject to ‖α‖1 ≤ λ(4)

The notation differs slightly from the one used earlier. x denotes only one image

patch because the problem has to be solved for every patch individually, α is the

corresponding sparse code. Solving the problem is finding the dictionary elements with

assigned weights which are minimizing the reconstruction error. The reconstruction

error is calculated by taking the absolute difference between pairwise pixels of two

patches.

An example of patch reconstruction can be taken from Figure 9. The original patch

can be reconstructed quite accurately with three other patches. The assigned weights

are given for every patch, a negative weight means that pixel values need to be flipped

which results in black pixels becoming white and vice versa.

Original patch

Reconstruction = 0.21× −0.26× +0.54×

Figure 9: Reconstruction with dictionary elements.

The sparsity is retained by the constraint ‖α‖1 ≤ λ. Noticeable is that not the ‖ · ‖0-norm as it was the case for the dictionary learning is used but the ‖ · ‖1-norm. This

enables a more accurate reconstruction because instead of limiting the amount of non-

zero entries, the absolute values of the weights are limited. Still, this constraint leads

to a sparse vector α.


Both, learning the dictionary and reconstructing the images can be computed by

SPAMS (SParse Modeling Software) which was introduced in [MBPS09] and is publicly

available. It provides efficiently implemented functions, is written in C++ but there

are also interfaces to MATLAB, Python and R.

3.2.4 Feature Construction

The previous sections explained the necessary basics, this section described how the

actual HSC feature is being constructed. After learning the dictionary, the following

steps have to be conducted:

1. Separate the car image into cells, as shown in Figure 5a, which is equal to the

first step in creating the HOG feature.

2. Extract all image patches from a cell in a sliding window approach. The cell

is passed through pixel by pixel, as depicted in Figure 10, and thus all image

patches are extracted.

3. For every extracted patch the sparse codes and according weights are calculated

by solving the minimization problem in Equation (4).

4. Compute the absolute value for every sparse code. This is necessary because it

is not of importance whether an occurring edge is black or white. The important

fact is that there is an edge of a certain orientation and this is preserved by taking

the absolute value.

5. Sum up the absolute values of the sparse codes for each cell. As a result, we

obtain a vector fi for each cell i which contains the frequency of each dictionary

element in this cell. For instance, it reveals that the first dictionary element is

used ten times, or that the second element is only used twice. This also explains

the naming of the HSC (Histograms of Sparse Codes) feature, because histograms

are a typical way to visualize frequency distributions.

6. Normalize all fi’s individually on a per cell basis.

7. Concatenate the normalized fi’s to obtain the final feature vector, see 11b.

Assuming that a cell only consists of two patches for simplification, two sparse codes

are the result of solving the reconstruction problem as shown in 11a. Then the ab-

solute values are computed and the vectors are summed up column by column. The

summation result is the above mentioned vector fi.

To clarify the process of the feature construction Algorithm 1 illustrates the described

steps in a clear way.


Figure 10: Sliding through the image to extract image patches.

|(0, . . . , 0,+0.2,−0.4, 0, . . . , 0)|+|(0, . . . , 0,+0.4,+0.1, 0, . . . , 0)|= (0, . . . , 0,+0.6,+0.5, 0, . . . , 0)

(a) Summing up the absolute values.

f1‖f1‖2

⊕ · · · ⊕ fn‖fn‖2

(b) Concatenating the normalized features per cell.

Figure 11: Construction of the HSC feature.

Algorithm 1 HSC feature construction.

Require: dictionary

procedure createHSC(image)

cellImages← seperateIntoCells(image)

for each cellImage in cellImages do

imagePatches← extractAllImagePatches(cellImage)

for each imagePatch in imagePatches do

sparseCode← solveReconstruction(imagePatch, dictionary)

end for

cellSum←∑|sparseCode| . Summing up all sparse codes

normCellSum← cellSum

‖cellSum‖2end for

hscFeature←⊕

normCellSum . Concatenates all cell features

end procedure


3.2.5 Dimensionality of HSC

The dimensionality of the HSC feature depends on the amount of cells used, as well

as the amount of dictionary elements. This follows because for every cell the HSC

feature vector contains as many elements as the dictionary. Every entry in the feature

vector states how often each dictionary element is used and can consequently not be

smaller than the dictionary itself. When, for example, a dictionary with 64 elements,

four vertical and 6 horizontal cells are used, it results in a feature length of 1536, as

shown in Figure 12.

dictionary elements

×vertical cells×horizontal cells

= feature length=⇒

64

×4

×6

= 1536

Figure 12: Factors deciding the feature length of HSC.

A comparable configuration for HOG with the same amount of cells would result in a

feature length of only 540. Section 5.2.1.2 is investigating the effect of the dictionary

length on the quality and the computation time for the predictions.

ESTIMATION OF VEHICLE ORIENTATIONS 23

4 Estimation of Vehicle Orientations

In this section, different approaches for the estimation of vehicle orientations are de-

scribed based on two different problem formulations. Modifications for predictors are

introduced which incorporate the particular structure of the problem. Furthermore,

issues arising from the imbalance of the data set are regarded and possible solutions

are given.

4.1 Problem Formulation

In Machine Learning, problems can be categorized according to the prediction value

they produce [Bish06]. If the prediction is a continuous value, it indicates a regression

problem. If a class is supposed to be predicted, it is a classification problem.

In the present case of the orientation estimation, the orientation angle of an observed

vehicle is supposed to be predicted. This angle is a continuous value in the range from

−π to π, thus indicating a regression problem. However, it was shown previously in

[OBTr15] that formulating the problem as a classification problem can be beneficial.

This section explains how to formulate the problem as a regression respectively a classi-

fication task. During the experiments for both of these formulations, it was found that

each formulation comes along with certain advantages and shortcomings (see Section

5.2.2). To keep the advantages of both methods, an approach is introduced here which

combines the regression with the classification model.

4.1.1 Regression

As mentioned above, the prediction of orientation angles can be seen as a regression

task. The model predicts one value ranging between −π to π. Table 5 shows some

exemplary angles and their assignments to orientations. A problem which arises from

this, is that angles for the left side can be either +π or −π. The angle values are

cyclical and not, as it would be normal for regression, a number line (see Figure 13b).

The regression model does not know that −π and π are in fact the same, as depicted

in Figure 13a.

Table 5: Exemplary angles and orientations.

Angle (in radian) ±π −π/2 0 π/2

Orientation left side rear right side front

24 ESTIMATION OF VEHICLE ORIENTATIONS

If there are two vehicle images, one which has an orientation angle in radian of 3 and

the other an angle of -3, the model assumes that these two images are very different.

However, in feature space these two images are quite similar. This can lead to a

confusion of the model and to bad predictions for left side images. This is obviously a

shortcoming which is inherent for the formulation as a regression problem.

π/2

0

-π/2

± π

-

+

(a) Cyclical angles.

-π/2 0 π/2-π π

(b) Continous angles.

Figure 13: Angle representations.

To solve the regression problem, Decision Trees and Support Vector Regression, which

were introduced in Section 2.1.3, can be utilized.

4.1.2 Multi-Class Classification

[OBTr15] reports that “multi-class SVM produced significantly better orientation es-

timation compared to support vector regression”. In order to do Multi-Class Classi-

fication the problem needs to be formulated differently. As mentioned in [Barb10], a

continuous output can be discretized and thus a corresponding classification problem

can be considered.

Instead of assuming continuous values for the orientation angles, the angles can be

divided into classes. Each class consists of a distinct range of angles. For instance,

the angles can be divided into 16 classes, so that every class contains angles from a

range of 360◦

16= 22.5◦. The division into classes can be taken from Figure 14, as well as

exemplary images from each of the 16 classes.

When one of these classes is predicted, the center value of the class is taken as the

prediction value. For example, if the fourth class is predicted, the center value, and

consequently the prediction value, is −π/2. This, however, has the shortcoming that

the amount of distinct predictions is limited to the amount of classes. It also means

that the amount of very good predictions could decrease because of this limitation.


1

2

34

5

6

7

8

9

10

1112

13

14

15

16

front

rear

right sideleft side

(a) Division of angles. (b) Exemplary images from each class.

Figure 14: Angles divided into 16 classes.

4.1.3 Combined Classification and Regression

Using regression has the advantage of being able to predict all values in a range, thus

enabling the model to predict all potentially existing orientation angles. On the other

side, as mentioned earlier Classification can deliver better overall results. This can

result from the fact, that more image examples are available and thus more variation

can be learned for each class.

Combining regression and classification could lead to an overall improvement if one of

the following scenarios apply:

1. Regression or classification is working significantly better for a certain range of

angles, or

2. A relationship between the predictions of regression and classification can be

identified and exploited.

In the conducted experiments, it was found that the latter scenario applies. First, the

image which has to be predicted is passed to both the regression and the classification

model. Depending on the relation between these two predictions either the prediction

of the regression or the classification is used as the final prediction. The approach is

depicted in Figure 15.

A simple rule was identified which can help distinguish whether to use the prediction

of the regression or the classification. The rule, shown in Algorithm 2, exploits the

fact that when the two predictions of both models are close, it is better to use the

prediction of the regression model. If, however, the difference between the predictions

is large, it is beneficial to use the prediction of the classification model. Note, that the

distance function in the algorithm is the smallest angle between two angles.


RegressionModel

ClassificationModel

regressionPrediction

classPrediction

DecisionRule

finalPrediction

Figure 15: Process of combined method.

Algorithm 2 Decision rule for combination of regression and classification.

procedure combinePredictions

Predict individually :

regressionPrediction← predictRegression(image)

classPrediction← predictClass(image)

Decision:

if distance(regressionPrediction, classPrediction) ≤ threshold then

finalPrediction← regressionPrediction

else

finalPrediction← classPrediction

end if

end procedure

The threshold which decides when to use either the Regression or the Classification

model is subject to optimization and has to be determined in experiments.

4.2 Classifier Modifications

This section suggests modifications to the classifiers. The weighted voting approach re-

places majority voting and incorporates the underlying circular structure of the classes.

The second approach uses specialized classifiers which only predict a certain class under

a high degree of certainty. The third approach utilizes a second layer of probabilities

instead of only handing back a binary result from the one-vs-one classifiers.

4.2.1 Weighted Voting

Assuming one-vs-one classifiers are used, a method to decide on which final prediction

to be used can be majority voting explained in Section 2.1.2.2. Instead of this decision

rule a different rule which incorporates the circular structure can be developed. The aim


of this rule is to reinforce the importance of classifiers which are from opposite angles.

In contrast, classifiers for neighboring classes can be considered of less importance.

Weights can be used to implement this rule. The prediction of a classifier for opposite

angles has a large weight whereas a classifier for neighboring classes has a small weight.

Suppose that the whole range of angles is divided into six classes opposed to an earlier

example when the range was divided into sixteen classes. Then, class 1 and 4 are

opposite classes, as well as class 2 and 5, and class 3 and 6. Table 6 contains a

weight matrix which can be applied on the prediction of the one-vs-one classifiers. The

classifiers for the above mentioned opposite classes have the largest weight 5 whereas

classifiers for neighboring classes are weighted with 1. The classifiers in between have

a weight of 3.

Table 6: Example for weight matrix.

Class

1 2 3 4 5 6

Class

1 - 1 3 5 3 1

2 - - 1 3 5 3

3 - - - 1 3 5

4 - - - - 1 3

5 - - - - - 1

6 - - - - - -

Table 7 includes responses of the single one-vs-one classifiers. If majority voting is

applied on these responses, the final prediction would be class 1 because four classifiers

predicted this class. Class 4 is only predicted three times. In contrast, if weighted

voting is applied the final prediction changes to class 4 because each time a one-vs-one

classifier is predicting a class, the vote is multiplied by the weight taken from the weight

matrix. Class 4 is predicted three times with the weights 5, 3 and 3, resulting in a

weighted vote of 11, whereas class 1 is predicted four times with the assigned weights

1 + 3 + 3 + 1 = 8.

4.2.2 Specialized Classifiers

In Section 2.1.2.1, it was mentioned that a conservative predictor can be learned by

using misclassification costs. These costs have to be set in a manner that a classifier

is only predicting a certain class when it has a high degree of certainty. To this, it is

necessary to learn all possible one-vs-one classifiers, not only the 1-vs-2 classifier but

also the 2-vs-1 classifier. This results in having k ∗ (k−1) classifiers altogether whereas

for the traditional one-vs-one approach only half as many classifiers are learned.


Table 7: Responses of six-class problem.

Class

1 2 3 4 5 6

Class

1 - 1 1 4 1 1

2 - - 3 4 2 2

3 - - - 3 5 6

4 - - - - 5 4

5 - - - - - 6

6 - - - - - -

If the 1-vs-2 classifier predicts class 1, it is very certain that the data point belongs to

class 1. If it however predicts class 2, then the interpretation of the result is that the

classifier is unsure whether it is class 1 or 2. To put it in another way, when the 1-vs-2

classifier predicts class 1, the result is being kept. If it predicts class 2 the result can

be discarded. The classifier can hence be seen as specialized for the prediction under

high certainty for a single class.

Consequently, in a four-class scenario, for every class there are three specialized clas-

sifiers which can tell if the seen data point belongs to it or not. For calculating the

votes this means that if a X-vs-Y classifier predicts X, the votes for X are increased by

one. If a X-vs-Y classifier predicts Y, the votes for X are not changed and the result is

discarded.

Table 8 shows exemplary responses of such specialized classifiers. The first row within

the table represents the responses of the 1-vs-Y classifiers, so only if the response is 1,

it would count as a vote for class 1. This is however not the case and the same holds

true for the 2-vs-Y classifiers. The 3-vs-Y classifiers predict 3 once, and the 4-vs-Y

classifiers predict 4 twice. Then, the majority rule is applied so that class 4 is the final

prediction as it has two votes.

Table 8: Responses of specialized classifiers.

Class

1 2 3 4

Class

1 - 2 3 4

2 1 - 3 4

3 3 2 - 4

4 4 2 4 -


4.2.3 Second Layer Probability Classification

Normally, one-vs-one classifiers only predict one class and no information on how likely

this prediction was is being kept. Though, some classification models, e.g. SVM, are

able to return probabilities which provide such an information. This could help when

deciding the final prediction in the next step. Then, instead of applying the majority

rule, all these probabilities can be considered.

In the case of four classes, there are six X-vs-Y classifiers and each one of them returns

a probability value, as shown in Table 9. This value is positive if class X is more likely

and negative if class Y is more likely. Then, the six probability values can be considered

as a new feature vector. The resulting feature can be put into an additional classifier

which quasi forms a second layer of classifier. Predictors with multiple layers are also

used in artificial neural networks [Fuku80] which often consist of multiple hidden layers.

Table 9: Probability predictions of first layer classifiers.

Class

1 2 3 4

Class

1 - −0.1 −0.5 −0.9

2 - - +0.1 +0.1

3 - - −0.9

4 - - - -

The example data is the same as in Table 2 but instead of a class the classifiers are

returning a probability. In the example, three classifiers are predicting class 2 (marked

with blue), the probability for these predictions however are with 0.1 quite low. Class

4 (marked with red) is only predicted twice but the probabilities are with 0.9 a lot

larger. When applying the majority rule, class 2 would have been the final prediction.

Using a second layer of classifiers however could lead to a more detailed analysis of the

underlying probabilities. Therefore, a prediction of class 4 is expected in this example.

The second layer consists again of one-vs-one classifiers and the final output of the

second layer is obtained by applying the majority rule on its classifier results. The

necessary steps for working with two layers of classifiers is as follows:

• In the training stage:

1. Learn the first layer of one-vs-one classifiers with the training set.

2. Put the data points of the training set into the first layer classifiers and keep

their probability vectors for every data point.

3. Learn the second layer of one-vs-one classifiers with the generated probabil-

ity vectors.


• In the testing stage:

1. Put a data point of the test set into the first layer of classifiers and obtain

a probability vector.

2. Put the probability vector into the second layer of classifiers to obtain the

final prediction.

Results for all of the above proposed methods are given in Section 5.2.3.

4.3 Class Imbalance

If, in a classification problem, a class contains a lot more data points than some other

class of the dataset, class imbalance is present. This is not something unusual and is

often the case in real-world scenarios.

For instance, when a car is recording images by driving through a city and capturing

images with a camera which is directed to the front, on the recorded images there will

be more vehicles from the rear and the front than vehicles from the left or right side.

Also, depending on whether the country has left- or right-hand traffic, the resulting

distribution of recorded vehicle orientations will differ. Figure 16 contains the amount

of images from the KITTI data set (see 5.1.1) if the orientation angles are divided into

16 classes as described earlier in Section 4.1.2.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

1000

2000

3000

4000

5000

6000

Figure 16: Images per class.

Class 4 has the highest amount of vehicles because most of the time vehicles are

recorded which are driving in front or on neighboring lanes (classes 3 and 5). Class 13

is the second largest class, this class contains vehicles which are passing the recording

car on the left side. In a country with left-hand traffic this peak is expected to be in

class 11. However, in right-hand traffic it is visible that class 11 is small which is due to

the fact that normally there are no vehicles passing the recording car on the right side.


The few images in this class are mostly vehicles which are parked on the pavement in

the opposing direction. Furthermore, the classes eight and sixteen contain images from

the right and respectively the left side of vehicles. The amount of vehicles in these

classes is also quite small, the vehicles in these two classes are typically only recorded

when the recording car is waiting in front of traffic lights and the other vehicles are

crossing the intersection.

Class imbalance can cause problems for learning a model. There are two main problems

which can arise from this:

1. The model is less likely to predict into a small class because the large classes

dominate the construction of decision boundaries.

2. The model does not learn enough variation due to the small amount of examples.

There are different methods to deal with class imbalance, i.e., under- and over-sampling

which are studied in [Japk00]. Additionally, when dealing with images, the possibility

of mirroring images exists. The methods are described and discussed in the following

sections. Issues arising from class imbalance can also be solved algorithmically by using

cost-sensitive classifier. This is not addressed here but can be found in the work of

[LiZh06] and [SKWW07].

4.3.1 Under-sampling Data Points

To obtain a more equally distributed dataset, data points from large classes could be

under-sampled. This can be achieved through removing a certain subset of data points.

An equally distributed dataset has the advantage that the probability of predicting a

class is not biased because of the existence of a large class [KuMa97]. Depending on

the chosen prediction model, removing data points can reduce the time to learn the

model. However, removing examples from the datasets comes along with learning only

reduced variation because of the smaller amount of examples shown to the model.

The main question that arises is which images can be removed without making the

prediction model worse than before:

1. Removing data points randomly:

Data points of the large classes could be removed randomly. This is easy to

implement and no further assumptions which could lead to biasing have to be

made.

The disadvantage of this approach, and also under-sampling methods in general,

is that variation in how data points can appear is being lost. In order to learn

a good model, it is inevitable to show a wide range of data points which include


a certain degree of variation. Only then the model is able to generalize well also

for unseen data. The extreme counterpart would be that a model is only learned

for a handful of data points, it will most likely not be able to predict well unseen

data because the model is overfitting the training data [Hawk04].

2. Removing quasi-duplicates:

There are a lot of cars from the rear in the dataset because sometimes the record-

ing car was driving behind a single car for a long time. This results in what is

here called quasi-duplicates because the image of the car is almost the same but

not an exact duplicate. Recording almost equal images from cars can also occur

in other classes due to waiting in front of red lights but this is not happening

that often.

Figure 17 shows two consecutive frames where a car is driving in front of the

recording car. Enlarging the car reveals that there is almost no variation visible

between the enlarged vehicles. This could justify the removal of quasi-duplicates.

There is also a large amount of images from the front sides of cars. As the cars

are normally moving towards each other, the geometry of the approaching vehicle

image together with its orientation angles change more rapidly so that these cars

do not have to be considered as quasi-duplicates.

To deal with the issue of quasi duplicates, a method was implemented which is

keeping only every tenth image from the rear of cars in sequential frames. This

method incorporates the fact that a car which is in front of the recording car is

most likely still in front of the car in the subsequent frame. To be able to cope

with lane changes the threshold was set to the above mentioned value of 10.

Figure 17: Quasi-duplicates of consecutive frames.


4.3.2 Over-sampling Data Points

There are different ways to over-sample data points for small classes. Two straightfor-

ward methods for over-sampling are:

1. Duplicate datapoints:

A simple method to over-sample data points is to duplicate existing data points

of small classes. The success of this method depends on the utilized classifier. For

instance, when using the k-nearest-neighbor algorithm [CoHa67], a duplicate has

certainly an impact on the classification. If, however, a decision tree is being used,

a leaf could include the same region as without the duplicate and the duplicate

would have no effect on the resulting prediction.

Another obvious shortcoming of the duplication of data points is that no ad-

ditional variation can be learned by the model from the duplicates. Quite the

opposite can be the case, namely the model can be overfitted by the sole replica-

tion of images.

2. Create synthetic datapoints:

[CBHK02] introduces SMOTE (Synthetic Minority Over-sampling Technique)

which can overcome the problems of creating exact duplicates. In the mentioned

work it is shown, that by creating synthetic data points better decision bound-

aries, i.a. for decision trees, can be learned. Furthermore, it was shown that

duplicates do not help learning better decision boundaries in the same extent.

The synthetic data points are created by introducing data points which have

small variation compared to existing data points. This work is using images and

it is not obvious how a vehicle image can be changed directly in a legitimate way.

However, it is conceivable to synthetically modify the feature vector created from

the original image in their proposed way.

In [CBHK02] it was shown that a combination of under-sampling the large class and

over-sampling of the small class can lead to an improvement of certain classifiers. On

the other hand, [DrHo03] shows that for C4.5 [Quin93], an algorithm to learn decision

trees, under-sampling performs better than over-sampling.

4.3.3 Mirroring Images

Another method to overcome class imbalance is to increase the number of images by

mirroring images. The opposite class of a small class is chosen, e.g., the opposite class

of class 11 is class 13, and the opposite class of class 10 is class 14. Then, the images

from the opposite class can be mirrored by simply flipping the pixel values. Example


images from a small class and an opposite class, as well as the mirrored vehicle image

are shown in Figure 18.

(a) Small class. (b) Opposite class. (c) Mirrored image.

Figure 18: Image mirroring.

The image from the small class shows the untypical case when the right front of a

vehicle is seen which is only happening in exceptional situations in right-hand traffic.

By mirroring images the amount of image examples is increased with natural images.

One could argue that these images are not natural in the sense that the driver and

the steering wheel is on the wrong side. As the utilized image descriptors do not rely

directly on this and are more abstract, it can be considered being equal to the original

images.

The main advantage of mirroring images is that the prediction model can learn a lot of

variation from these new images. As the images are naturally created, it is not random

noise which is learned but real variation which can occur in the real world. The amount

of images to be flipped might have to be limited to a certain amount of images. This is

in particular the case when mirroring images from class 13 for the small class 11. The

distribution of images would change substantially which could have undesirable effects

mentioned in Section 4.3.4.

Table 10 summarizes the advantages and disadvantages of the methods explained in

the preceding sections.

Table 10: Characteristics of methods to balance classes.

Method Under-sampling Over-sampling Mirroring

Advantage No domination of

large classes

Increased impact of

small classes

Additional variation

Disadvantage Loss of variation Few additional vari-

ation

Potentially too many

new data points


4.3.4 Changing the Value Distribution

All previously described methods have a shortcoming which was not mentioned until

now. Normally, it is assumed that the distribution of the output values is the same

for the training and the test set. When the distribution is changed by over-/under-

sampling or through the mirroring of images, the model is more likely to reproduce

this modified distribution, even if the distribution of the test set is different.

This problem is investigated in the field of Domain Adaptation (see also [DaMa06])

but is not addressed in this work.

EXPERIMENTS 37

5 Experiments

This section gives a detailed description of the used dataset and how the results are

measured. The proposed features as well as the different approaches described in the

preceding sections are investigated and the results are reported and explained.

5.1 Datasets

Finding images of cars is not a problem, search engines can provide a plethora of

car images. Though, finding useful labeled images can be a problem. The ImageNet

database2 [DDSL09] comprises a lot of car images, the annotations are however limited

to car types and no orientation angles for the cars are given. To be able to predict vehi-

cle orientations, it is a necessity that the dataset offers annotations for these orientation

angles. Some datasets which include these annotations are:

1. KITTI Vision Benchmark Suite (Karlsruhe Institute of Technology and Univer-

sity of Toronto) [GeLU12];

2. NYC3DCars (Cornell University) [MaSn13];

3. Pascal3D+ dataset (Stanford University, 12 categories thereof one category with

cars) [XiMS14].

The experiments conducted in this work are using the very popular KITTI Vision

dataset because it is recorded by a car driving in normal traffic, hence close to the later

use case for driver assistance systems. The other two datasets consist mostly of car

pictures taken by pedestrians from outside the driving lanes.

5.1.1 KITTI Vision Dataset

The datasets for the KITTI Vision Benchmark Suite was recorded by driving through

a German city and the surroundings. The car was equipped with the following sensors

(see also Figure 19):

• 2 color and 2 gray scale cameras;

• 1 Velodyne laser scanner;

• 1 GPS unit.

2http://www.image-net.org/ [Accessed December 18, 2015]

http://www.image-net.org/

38 EXPERIMENTS

Figure 19: Sensors on recording car [GeLU12].

With the recorded datasets different challenges are created. When taking part in a

challenge the results can be submitted to an evaluation server and the results are doc-

umented in a benchmark on the KITTI Vision project page.3 The following challenges

are available:

1. Stereo Evaluation 2012/2015;

2. Optical Flow Evaluation 2012/2015;

3. Scene Flow Evaluation 2015;

4. Visual Odometry/SLAM Evaluation 2012;

5. Object Detection Evaluation 2012;

6. Object Tracking Evaluation 2012;

7. Road/Lane Detection Evaluation 2012.

The Object Detection Evaluation 2012 consists of stereo images and point clouds of the

Velodyne scanner. This work however is only using forefield images, as shown in Figure

20, from one of the cameras. There are in total 7,500 images with around 22,000 cars

on these images. On images with heavy traffic, there can be more than ten cars per

image, but there are also images without any cars. The annotations for the forefield

images comprise:

• Type of object (car, truck, pedestrian, cyclist, . . . );

• Bounding boxes (where in the image is the vehicle);

• Orientation angle ∈ (−π, π];

• Occlusion (object is occluded by other objects);

3http://www.cvlibs.net/datasets/kitti/index.php [Accessed December 18, 2015]

http://www.cvlibs.net/datasets/kitti/index.php

EXPERIMENTS 39

• Truncation (object is not completely inside the forefield image).

(a) City environment.

(b) Countryside environment.

Figure 20: Exemplary forefield images.

Depending on the degree of occlusion and truncation as well as the size of the car

image, the images are divided into three difficulties. The easy images are contained

in the moderate set, and the moderate images are contained in the hard set (easy ⊂moderate ⊂ hard). The accuracies obtained in the single difficulties are reported

separately on the benchmark page. Table 11 gives a more detailed explanation of the

difficulties.

Table 11: Difficulties of the vehicle images.

Difficulty Bounding Box Height Occlusion Truncation

Easy ≥ 40 pixel Fully visible ≤ 15 %

Moderate ≥ 25 pixel Partly occluded ≤ 30 %

Hard ≥ 25 pixel Difficult to see ≤ 50 %

The dataset was recorded on sunny days and only during daytime. Hence, testing

approaches which can deal with different weather conditions is not possible by using

this dataset. Still, it is a very popular dataset among researchers and a lot of research

was conducted by using this dataset for detection and orientation estimation since its

release in 2012 up until now [PSGS13], [OBTr14], [LiWZ14], [XCLS15], [OBTr15].

40 EXPERIMENTS

5.1.2 Evaluation Metrics

For every prediction task, an evaluation metric is necessary which quantifies the good-

ness of results. In [GeLU12], a metric called Average Orientation Similarity (AOS)

is introduced. The metric evaluates jointly the detection results and the orientation

estimation. If, however, only the orientation results shall be measured a simplified

version can be derived which can be taken from Equation (5).

m =1

|D|∑i∈D

f(∆i) (5)

D is the image dataset and i is an image from this set. f(·) is the actual evaluation

function which quantifies how good or bad a prediction was which in turn depends on

the deviation ∆i between the predicted and the actual value of the i-th image. To

evaluate the results, for every image from the dataset, the value of f is calculated as a

function of the deviation. Then, to obtain the evaluation metric the normalized sum

of these values are computed.

The maximal difference between a predicted angle and the ground truth of the angle

is 180◦ (or π). This implies that two basic requirements should hold for the evaluation

function f :

1. Deviation equals 0 (perfect prediction), the value of the function should be 1.

2. Deviation equals π (worst prediction), the value of the function should be 0.

5.1.2.1 Cosine Function. A modified cosine function was suggested by [GeLU12]

for the evaluation. It can be taken from Equation (6).

f(∆i) =1 + cos(∆i)

2(6)

This function fulfills the basic requirements stated above but it does not penalize small

errors a lot as can be seen in Figure 21a. This means that at the beginning when

the error grows larger the value of the function does not decrease as fast. As it is

characteristic for cosine functions, when the error continues growing the value of the

function approaches 0 quite fast.

In some experiments which are described in more detail in the following sections, an

undesirable effect was visible. Assuming that two experiments were made, in the second

experiment the amount of perfectly predicted images was decreased by 100 and the

amount of predictions with small errors was increased by 110, the second experiment

had a better evaluation metric compared to the first one. So, it can be argued that

already small errors are not desirable and should be penalized accordingly.

EXPERIMENTS 41

0 0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a) Modified cosine function.

0 0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b) Modified hyperbolic tangent function.

Figure 21: Evaluation functions.

5.1.2.2 Hyperbolic Tangent Function. In order to penalize small errors in a

larger extent, this work suggests a modified hyperbolic tangent function (see Equation

(7)) which has the desired property.

f(∆i) = 1− tanh(∆i) (7)

For an increasing error the value of the function is decreasing rapidly which is illustrated

in Figure 21b. That implies that small errors have an higher impact on the overall

performance. It can be said that this evaluation function is more pessimistic than

the cosine function. However, a shortcoming of the tanh function is that the second

requirement is not fulfilled because f(π) = 1− tanh(π) = 0.0037 and not 0 as desired.

5.2 Orientation Estimation

The KITTI Vision dataset consists of two sets. A labeled training set of forefield images

with the previously mentioned annotations and an unlabeled test set of forefield images.

As the test set is used to create the submission for the benchmark, it comes without

any annotations.

However, to evaluate the experiments for the orientation estimation, annotations are

necessary. Therefore, the KITTI training set is divided into ten subsets, or so-called

folds. It is ensured that each fold has almost the same number of forefield images.

The forefield images were recorded within independent drives on several days but in

each drive quasi-duplicates exist (see Section 4.3.1). This makes it necessary to divide

the available images so that all images from one drive are only contained in exactly

one of the folds. This is ensured by using the mapping information provided in the

development kit which includes the assignment of images to drives.

42 EXPERIMENTS

The results reported in this section are obtained by using 10-fold cross-validation

[Koha95]. For this, nine folds are used to learn a model and the tenth fold is used

for evaluation. This procedure is repeated so that each fold has been in the evalua-

tion fold once. The overall quality is reported by averaging the qualities of the single

cross-validation runs.

After the forefield images are divided in the above described way, the vehicles on the

images can be cropped using the provided bounding box annotations (see Figure 22).

This work focuses on car images, so that only these images are cropped out and trucks,

pedestrians, cyclists and others are omitted.

Figure 22: Cropping cars using the provided annotations.

5.2.1 Histogram-based Features

In this section, the results for the HOG and HSC experiments are reported, as well as

a comparison of these two image descriptors. Here, the formulation as a classification

problem is taken as a basis. In Section 5.2.2, the results for regression and also the

combined method are given.

5.2.1.1 HOG Features. An important parameter for the creation of the HOG

features is the number of vertical and horizontal cells. If too few cells are chosen, the

feature is not descriptive enough. However, too many cells could capture too much

information which can be counterproductive as the model will not be able to identify

EXPERIMENTS 43

the relevant information. It was found that a small number of cells, 4 vertical and

6 horizontal cells, is aggregating the image information strong enough while keeping

enough information about the edge geometry of the car in the image. The cells are

grouped into blocks of 2 x 2 cells.

The results for the described parameter setting are given in Figure 23. The bars

of the diagram tell how many images are predicted with a certain deviation. The

deviation is the smallest angle between the predicted pose p and the actual pose a

(both p, a ∈ (−π, π]). Meaning the highest deviation possible is π. For instance,

the first bar on the left means that a bit less than 3,500 images are predicted with a

deviation less than 0.17 which is approximately 10◦. These predictions are considered

as being nearly perfect. Then, the second bar states how many predictions there are

with an error between 0.17 and 0.34, and so on. The colors in the diagram tell from

which class the image example came from. Images from “−2.75” belong to class 1, and

up until “3.14” which contains images from class 16 as depicted in Figure 14.

Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

HOG features(cosine metric: 0.9490)(tanh metric: 0.8383)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

Figure 23: HOG results.

A. Analyzing Large Errors. Figure 23 reveals that there are some car images which

are predicted with the maximal error of 3.14. These are cars which are predicted

as being from the exact opposite angle which was also observed in [RHMH10]. For

instance, an image which is predicted as the rear of a vehicle but is actually an image

of the front. A few examples of images with a large error can be found in Figure 24.

The annotation dev = 3.14 means that the deviation between prediction and ground

truth is 3.14 ≈ π = 180◦, thus the maximal deviation.

B. Neighborhood of Images. To further analyze why a certain image is predicted

with a large error, the closest neighbors in feature space were investigated. To deter-

mine those neighbors, the distance between the features of two images can either be

44 EXPERIMENTS

#1, dev = 3.14 #2, dev = 3.14 #3, dev = 3.14 #4, dev = 3.14

#5, dev = 3.14 #6, dev = 3.14 #7, dev = 3.14 #8, dev = 3.14

#9, dev = 3.13 #10, dev = 3.13 #11, dev = 3.13 #12, dev = 3.13

#13, dev = 3.13 #14, dev = 3.13 #15, dev = 3.13 #16, dev = 3.13

Figure 24: Examples of badly predicted images.

measured with the l2-norm, or with the cosine similarity which is more suitable for

high-dimensional data. Figure 25 shows the nine closest neighbors of one of the badly

predicted cars which is shown in the top-left corner.

Neighbors which come from the same actual class as the badly predicted car are anno-

tated with red, neighbors which are in the same predicted class as the badly predicted

car are annotated with green. Looking at the nearest neighbors, it becomes apparent

that there are more neighbors from the predicted wrong class than from the actual

correct class. Therefore, it is not surprising that the prediction is wrong. Moreover,

when comparing the neighbors from different classes to each other, it is not a surprise

that these images are close in feature space. The overall orientations of the edges are

quite similar and only this is being captured by the HOG features.

To overcome this issue, it is conceivable that the HOG feature is combined with a

blob detector [Lind98] which is supposed to find tail or head lights in an image. For

instance, if tail lights are being detected, the prediction of the classes with front images

is blocked and only classes with rear images can be predicted. This approach would,

however, not work for side images and is not investigated further in this work.

C. Image Preprocessing. Some of the cars from Figure 24 are over- or underex-

posed which could justify the necessity of preprocessing the images. Image 3 and 11

are overexposed due to too much sunlight, image 8 and 16 are underexposed because

the cars are located in the shadow. The HOG feature is then not capable of capturing

the edges because there is not enough contrast in the image.

EXPERIMENTS 45

predicted imageactual class: 11

predicted class: 03

actual class: 11 actual class: 03 actual class: 11



Figure 25: Neighborhood of badly predicted image.

To solve this problem, a simple method was implemented which increases the contrast

in a different range depending on whether an over- or underexposed image was found.

However, the overall quality worsened when preprocessing all images this way.

5.2.1.2 HSC Features. To make it comparable, the same number of cells was used

to calculate the HSC features. The result diagram for HSC can be taken from Figure

26. It reveals that 79 images more could be predicted nearly perfect when using HSC

compared to HOG. Also, what is even more important, the number of images with

a large prediction error was reduced by 74. A more detailed comparison is given in

Section 5.2.1.3.

Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

HSC features(cosine metric: 0.9656)(tanh metric: 0.8533)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

Figure 26: HSC results with 256 dictionary elements.

46 EXPERIMENTS

Dimensionality of HSC Features. A problem of the HSC feature is its high di-

mensionality (see Section 3.2.5). It leads to a long computation time, especially for the

model training but also for the prediction. By applying Principal Component Anal-

ysis (PCA) [Pear01], [Hote33] or the faster Probabilistic PCA [TiBi99], it was tried

to reduce the dimensions. But as computing the principal components and rebuild-

ing the features is computationally expensive itself, the problem of faster training and

prediction could not be solved with these methods.

As the number of cells should stay the same as for HOG, it remains the number of

dictionary elements which can be reduced to get a reduction of the dimensionality of

HSC. The effect of dictionaries with a varying number of elements on the computation

time and the quality of the predictions can be taken from Figure 27.

dictionary length25 64 100 256

cosi

ne m

etric

0.9

0.92

0.94

0.96

0.98

1

0.955

0.964 0.964 0.966

com

puta

tion

time

[min

]

0

10

20

30

40

50

60

70

80

Figure 27: Dictionary length vs. accuracy vs. time.

When using a dictionary with 256 elements, it takes 60 minutes to do the cross-

validation and the cosine metric is 0.966. While reducing the elements in the dictionary,

a nearly linear reduction of the computation time can be observed (red curve). Besides

that, it is observable that the quality of the predictions is almost constant up until

64 dictionary elements (blue curve). In summary, it can be said that by reducing the

dictionary length the computation time is reduced and at the same time only a small

decrease of quality is visible.

When real-time computation is necessary, as it would be the case for a driver assis-

tance system, the long computation time of the HSC feature with 256 elements is a

shortcoming. Using a smaller dictionary would be more appropriate in this scenario.

5.2.1.3 Comparison of HOG and HSC. In the preceding section, it was men-

tioned that by using HSC the number of perfectly predicted cars was increased and

EXPERIMENTS 47

the number of predictions with large error could be reduced. Table 12 is quantifying

the findings. Besides the evaluation metrics and the computation time for the cross-

validation process, the percentage of predictions with a certain error is reported. A

deviation between prediction and ground truth of less than 0.33 is considered small, a

deviation between 0.33 and 2.81 medium, and a deviation larger than 2.81 large.

Table 12: Comparison of HOG and HSC.

HOG HSC (64) HSC (256)

cosine metric 0.949 0.964 0.966

tanh metric 0.838 0.850 0.853

small errors 93.5 % 94.7 % 95.0 %

medium errors 2.3 % 2.5 % 2.3 %

large errors 4.2 % 2.8 % 2.7 %

comp. time 3.3 min 10.0 min 59.2 min

Two different HSC features are used, once with 256 dictionary elements and the other

time only 64 elements. Both HSC features deliver better results compared to the HOG

feature. Though, using HOG features is faster than both of the HSC features.

5.2.2 Regression vs. Classification

As described in Section 4.1, the estimation of orientations can be formulated as a

regression or a classification task. This section gives results for both approaches as

well as for the proposed combined method. For the results reported here HOG features

were used, however, the HSC features behave equally.

5.2.2.1 Regression Results. Firstly, the results for two different regression mod-

els are given, a regression tree as described in [BFSO84] and SVR as in [VaVa98].

A. Regression Tree. The tree is pruned to avoid making the model overly complex.

This means that branches are cut off if the information gain is below a certain threshold.

As split criterion, the mean squared error is used. Modifying this criterion so that it

incorporates the circular structure of angles is conceivable and could lead to further

improvement of this regression approach.

Figure 28a shows the results of the used regression tree. Most predictions are very well

with a small error, yet another peak with the badly predicted images is visible. Also,

48 EXPERIMENTS

there is a small number of wrongly predicted images in the whole range of errors in

between.

Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Regression Tree(cosine metric: 0.8364)(tanh metric: 0.7258)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

(a) Regression Tree.

Deviation0.17 0.50 0.83 1.16 1.49 1.82 2.15 2.48 2.81 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

SVM Regression(cosine metric: 0.8299)(tanh metric: 0.4993)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

(b) Support Vector Regression.

Figure 28: Comparison of regression results.

B. SVM Regression. For the SVR predictions the LIBSVM library [ChLi11] is used.

A linear kernel is used and the parameter configuration is as follows: C = 1, ε = 0.1.

The results of the SVR predictions can be taken from Figure 28b. The results are

remarkably different from the results which were obtained so far. The number of

nearly perfect predicted images is a lot smaller, but also the predictions with a large

error are a lot less compared to the regression tree. The two peaks which are visible

in the regression tree are not noticeable when applying SVR. For SVR, a declining

number of predictions is detectable for every range of errors.

When comparing these two methods based on the evaluation metrics described in

Section 5.1.2, it becomes apparent that the values for the cosine metric is with 0.836

and 0.830 close. Though, it can be argued that a model which is predicting very well in

most cases and failing only in a few is better than a model which is making erroneous

predictions most of the time. This is uncovered by the hyperbolic tangent metric, where

the difference between the tree and the SVR is rather large with 0.726 and 0.499. A

detailed overview is given at the end of Section 5.2.2.3.

5.2.2.2 Classification Results. The range of orientation angles is divided into 16

classes of equal size as it was also done in [GeLU12] and [YBAL14]. Each class contains

angles from a range of 360◦

16= 22.5◦. Using this classification approach means that

only the 16 distinct class centers can be predicted. The maximal deviation between all

possible orientation angles and the class centers is 22.5◦

2= 11.25◦. The number of classes

is set to 16 because the mentioned maximal deviation can be considered acceptable.

EXPERIMENTS 49

Dividing into even more classes could have the undesirable effect that there will be

classes with even fewer or no data points at all.

SVM Classification. For this multi-class classification problem, one-vs-one classi-

fiers are used and the majority rule is applied on the single predictions. Again, the

LIBSVM library [ChLi11] is used to conduct the necessary computations. A linear ker-

nel is found to work best. This can be explained by the fact the the feature vector is

quite big and a more complex kernel, like a radial basis function (RBF) or polynomial

kernel, is not capable of better separating the data points. The cost parameter C of

the linear kernel is subject to optimization and is tuned by a reduced grid search (see

Figure 29).

cost parameter0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

cosi

ne m

etric

0.9

0.91

0.92

0.93

0.94

0.95

Figure 29: Tuning the cost parameter for the linear kernel.

The results of the predictions can be taken from Figure 23. Like in the regression case,

there is a peak for nearly perfect predicted images but another peak for very badly

predicted images. The main difference is that in the classification case, there are not

so many incorrectly classified examples over the whole range of deviations. Also, the

amount of images in the second bin is, with 1,200 compared to 400 in the regression,

a lot larger.

5.2.2.3 Combined Classification and Regression. As it was mentioned in Sec-

tion 4.1.3, a combination of classification and regression may be beneficial if a certain

relationship between these two approaches can be detected. For that, a result figure

with a smaller bin size for the errors is given in Figure 30. The results are the same,

the smaller bin size only enables a more detailed view on the predictions.

The regression provides perfect predictions (deviation < 0.08) for around 2,500 images

whereas with classification only 1,500 images are perfectly predicted (see Figure 30a).

This can be explained with the fact that for the classification predictions only the 16

50 EXPERIMENTS

Deviation0.08 0.42 0.76 1.10 1.44 1.78 2.12 2.46 2.80 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Regression Tree(cosine metric: 0.8364)(tanh metric: 0.7258)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

(a) Regression results.

Deviation0.08 0.42 0.76 1.10 1.44 1.78 2.12 2.46 2.80 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

SVM Classification(cosine metric: 0.9490)(tanh metric: 0.8383)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

(b) Classification results.

Figure 30: Comparison of regression and classification.

class centers are available. Hence, some images cannot be predicted more accurately

because some angles are in between these centers and will always be erroneous. For

classification, the second and third bin in the figure which consists of predictions with

an acceptable error (see Figure 30b) is a lot larger than the corresponding bins at the

regression. As previously mentioned, it follows for the other error ranges a lot of badly

predicted images for regression and not so many for the classification.

To sum up, regression is predicting very well most of the time, however, if it fails the

error can be anything from a small error to a large error. Classification is predicting

well in most cases but if it fails the error is most likely large. With this information,

the earlier described method was developed which incorporates the advantages of both

methods. The threshold θ which decides when to use which model was optimized. The

best results where obtained setting it to θ = 0.25.

Figure 31 shows that combining the two approaches was beneficial. Firstly, the number

of perfect predictions was increased, now even more images are predicted perfectly com-

pared to the regression. Secondly, the erroneous predictions over the whole spectrum

of errors was reduced because then the classification model is taking over. Though,

the number of badly predicted images was not reduced because this is a shortcoming

in both prediction approaches.

Another point which can be taken from the figures is that while there is obviously a

strong improvement the cosine metric does not change a lot, from 0.949 when only

applying SVM classification compared to 0.950 when using the combined method.

However, in the hyperbolic tangent metric the change is more visible where it changes

from 0.838 to 0.870.

Table 13 gives an overview of the tested approaches and methods. SVR has the smallest

percentage of large errors but is exceeded by the other methods in all other categories.

EXPERIMENTS 51

Deviation0.08 0.42 0.76 1.10 1.44 1.78 2.12 2.46 2.80 3.14

Num

ber

of Im

ages

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Regression Tree + SVM Classification(cosine metric: 0.9504)(tanh metric: 0.8698)

-2.75-2.36-1.96-1.57-1.18-0.79-0.39+0.00+0.39+0.79+1.18+1.57+1.96+2.36+2.75+3.14

Figure 31: Results of combined classification and regression.

The regression tree is the fastest when it comes to learning and predicting the model.

When only comparing the error percentages of the SVM classification and the combined

method, one could infer that SVM classification is better, however, these percentages

depend on where the boundaries are being set. The evaluation metrics report better

results for the combined method because they take into account that the combined

method has the largest number of perfect prediction (deviation < 0.08). Although,

the computation time for the combined method is understandably longer and amounts

to the sum of the computation time for the regression trees and the SVM classification

plus the overhead for deciding which model to take.

Table 13: Comparison of regression and classification methods.

Regr. Tree SVR SVM Classif. Combined

cosine metric 0.836 0.830 0.949 0.950

tanh metric 0.726 0.499 0.838 0.870

small errors 73.3 % 32.4 % 93.5 % 92.8 %

medium errors 17.2 % 66.7 % 2.3 % 3.0 %

large errors 9.5 % 0.9 % 4.2 % 4.2 %

comp. time 0.7 min 7.7 min 3.3 min 4.3 min

5.2.3 Results for Classifier Modifications

This section reports the results for the classifier modifications which were introduced

in Section 4.2. The first approach incorporates the circular structure of the classes

by using weighted voting. The second method is using specialized classifiers which

are trained to predict a class only under high certainty. The third approach is using

a second layer of classifiers which is using probabilities from the one-vs-one classifier

52 EXPERIMENTS

instead of doing a majority vote. Table 14 lists the results of the three modified

approaches and compares them to the application of the simple majority vote.

Table 14: Comparison of classifier modifications.

Majority V. Weighted V. Spec. Classif. Prob. Classif.

cosine metric 0.9490 0.9489 0.9488 0.9211

tanh metric 0.8383 0.8379 0.8381 0.7921

small errors 93.5 % 93.5 % 93.5 % 85.5 %

medium errors 2.3 % 2.3 % 2.3 % 8.7 %

large errors 4.2 % 4.2 % 4.2 % 5.8 %

comp. time 3.3 min 11.3 min 22.1 min 15.8 min

Weighted voting and the specialized classifier produce nearly the same results compared

to majority voting. The difference between the three methods is only visible in the

fourth decimal place of the metrics. Weighted voting has a considerably longer com-

putation time compared to majority voting. This is due to the fact that this method

was implemented in MATLAB whereas the majority voting is already implemented

in the LIBSVM library which is based on C++. The method with the specialized

classifiers also has a long runtime because twice as many classifiers need to be trained.

The results which are produced by the approach with a second layer of classifiers are

not as good compared to the other three methods. Also, the runtime is high for this

approach because probabilities have to be calculated in the first layer and a second

layer of classifiers has to be learned.

To sum up, despite the fact that majority voting is a simple rule it, it produces slightly

better results than all of the other methods. Incorporating information about the

circular structure of angles or the extraction of a detailed view in form of probabilities

did not lead to a quantifiable improvement of the overall results in this particular case.

5.3 Joint Object Detection and Orientation Estimation

Until now, the experiments were conducted by using the cropped car images. To

evaluate how well the developed orientation estimator is working it has to be compared

to other models. This was done by participating in the so-called “Object Detection and

Orientation Estimation Evaluation” provided by the KITTI Vision developers. For the

evaluation, the test set which comes without annotations is used. The bounding box

annotations and the orientation angles need to be created and then uploaded to the

evaluation server. The server executes the evaluation automatically and reports the

EXPERIMENTS 53

results by comparing the submitted annotations with the ground truth which is not

publicly available.

To be able to submit results a joint task has to be accomplished. Firstly, an object

detector needs to find all cars in the forefield images, secondly, the orientation angle

for every found car has to be predicted by the orientation estimator. The process is

depicted in Figure 32.

Orientation Estimation Model

angle1=1.96 angle 2=1.57 angle 3=−1.57

Object Detection Model

Figure 32: Joint detection and estimation orientation.

The Average Orientation Similarity (AOS) reported by the evaluation server incor-

porates not only the quality of the orientation estimation but also the quality of the

detection. The object detector quality is an upper limit for the orientation estimator

quality with the result that the AOS can never be higher than the detection metric.

This means that a poor vehicle detector in combination with a great orientation esti-

mator will still result in a poor evaluation score. To achieve good results, both a good

orientation estimator and a good detector is necessary.

The approaches for this combined challenge can be differentiated into two categories:

54 EXPERIMENTS

1. Learning a joint model which is able to detect a vehicle and, at the same time,

predicts the orientation. This can be beneficial in the way that information, which

is generated while looking for a car, can be used for the orientation estimation.

2. A two-step approach where the detection is done first and the orientation esti-

mation model is built on top as the second step.

As this work is focusing on orientation estimation, the second approach was chosen

where the orientation estimation works independently from the detection. First, an

object detector was trained with the help of publicly available code from [OBTr15].

To end up with comparable results, in a second step, the exact detection results which

were provided by the researchers of [OBTr15] were utilized.

5.3.1 Processing Pipeline

To create the submission results, first the object detector has to be trained or alterna-

tively a pre-trained model can be used. Then, for the orientation estimator the best

model which was determined by using cross-validation has been chosen (see Section

5.2). The next step is to detect cars in the KITTI Vision test set with the object

detector which returns a bounding box and a detection score. The detection score

expresses the confidence of the detection and is used to create recall/precision-curve as

well as to calculate the evaluation score. By means of the bounding boxes, the cars are

cropped from the test images and the feature vector (HOG or HSC) for the cropped

image is computed as it was the case for the training images. The vector is handed

to the orientation estimator which returns the orientation angle of the car. To give a

better overview, the processing pipeline looks as follows:

1. Train (or load a pre-trained) object detector,

2. Train orientation estimator,

3. For each forefield image in the test set:

(a) Detect cars in the forefield image (return bounding box and detection score),

(b) Crop cars from the forefield image,

(c) For each cropped car:

i. Generate feature vector,

ii. Predict the orientation of the car (return orientation angle).

EXPERIMENTS 55

5.3.2 Submission Results

As mentioned earlier, two different detectors were used. Since the AOS depends on the

object detection results, the results of two different detectors cannot be compared to

each other, only submissions which used the same detectors can be compared reliably.

The AOS is reported for the three difficulties (easy, moderate, hard), however, the final

ranking is only based on the score of the moderate case.

5.3.2.1 Trained Object Detector. The first two submissons used a detector

which was trained as described in [OBTr15]. The orientation estimator is using HSC

features with a dictionary of 64 elements. Quasi-duplicates of rear images were removed

from the cropped car images. The prediction was done by using multi-class SVM clas-

sification which uses one-vs-one classifiers and majority voting. In the training step of

the orientation estimator for submission (A) only images from the “easy” subset were

used. This subset contains neither truncated nor occluded cars and the vehicle images

must have a height larger than 40 pixels. For submission (B), also partly truncated

and partly occluded cars as well as cars with a minimal height of 25 pixels were used.

The results for both submissions can be taken from Table 15.

Table 15: Results of trained detector.

Submission Moderate Easy Hard

(A) Trained on easy set 68.82 % 76.36 % 54.87 %

(B) Trained on moderate set 70.41 % 76.53 % 56.63 %

As the final ranking is done by only considering the results for the moderate case, a

description of only this case is done here. For completeness, the results for the easy

and hard case are also reported in the tables. The results improved by 1.59 % when

the training of the orientation estimator included the images from the moderate set as

it was the case in submission (B). This means that teaching the model more variation,

in the sense that a car can also be partly occluded or truncated, led to an improvement

in the overall results. Thus, the orientation estimator is not as dependent on a perfect

car image as before.

5.3.2.2 Provided Detection Results. As mentioned above, the exact detection

results of [OBTr15] for the KITTI Vision test set were provided to make a comparison

possible. Four submissions (C)–(F) were made using these detection results. The setup

for submission (C) was the same like in submission (B) with the difference that instead

of only applying classification, the combined classification and regression approach was

used. For submission (D), also mirrored images for small classes were used. Submission

56 EXPERIMENTS

(E) does not use mirrored images and also the removal of quasi-duplicates was omit-

ted. And the final submission (F) used class weights to deal with the class imbalance

algorithmically. The results of the submissions are reported in Table 16.

Table 16: Results with provided detection.

Submission Moderate Easy Hard

(C) Removal of quasi-duplicates 73.77 % 82.97 % 58.14 %

(D) As (C) + mirrored images 73.67 % 83.00 % 57.99 %

(E) No removal, no mirroring 73.95 % 83.07 % 58.29 %

(F) As (E) + class weights 73.80 % 82.93 % 58.18 %

The improvement from submission (B) to submission (C) by 3.36 % is mainly caused by

the improved detection results. As mentioned earlier, results with different detectors

are not directly comparable. However, it is likely that also the combined classification

and regression as used in (C) led to a small overall improvement. Adding mirrored

images in submission (D) did not improve the overall results, this can be due to the fact

that the distribution of orientation angles is different in the training set (with mirrored

images) and the test set (no mirrored images). For submission (E), no quasi-duplicates

were removed and also no mirrored images were added. That way the distribution of

orientation angles is the same in the training as well as in the test set. This submission

provided the best overall results. The introduction of class weights, as it was done in

submission (F), which makes it more likely to predict small classes did not improve the

overall results.

To conclude the experiment section, Table 17 is giving a comparison between the

orientation estimator of [OBTr15] and the best submission which was produced in

this work. Note that the two estimators are using the same detection results and can

therefore be directly compared. Other orientation estimators are not listed here but

can be taken from the web page of the challenge.4

Table 17: Comparison of orientation estimators.

Rank Submission Moderate Easy Hard Runtime

6 [OBTr14] 74.42 % 83.41 % 58.83 % 0.7 s

7 (E) Best own submission 73.95 % 83.07 % 58.29 % 5.5 s

The results of the competing model slightly exceed the best submission of this work

by 0.47 %. The runtime which is an average of the processing time for each forefield

image is higher for the own submission. As the experiments were implemented using

4http://www.cvlibs.net/datasets/kitti/eval_object.php [Accessed December 18, 2015]

http://www.cvlibs.net/datasets/kitti/eval_object.php

EXPERIMENTS 57

MATLAB this does not come as a surprise, by changing to, for example, C++ a

considerable acceleration of the processing time can be reached.

CONCLUSION & FUTURE WORK 59

6 Conclusion & Future Work

In this work, a vision-based model was developed which is able to estimate orientations

of vehicles. The model can be plugged onto an object detector and only requires the

bounding box which contains the detected vehicle. Besides the bounding box no further

information is necessary to successfully estimate the vehicle’s orientation. Thus, the

model can be used together with any arbitrary object detector which was demonstrated

by taking part in a prediction challenge. Good results were achieved without needing

any additional information from the detector.

Extensive experiments for different prediction methods were conducted and reported.

Multi-class classification with an SVM produced very good overall results. The sug-

gested classifier modifications were not able to outperform the simple one-vs-one clas-

sifiers with majority voting. Furthermore, it was shown that a regression tree is pre-

dicting a lot of vehicles with a small error but the overall results are not as good as

an SVM classification. A new approach was developed which improved the results by

combining the advantages of SVM classification and a regression tree.

Two different image descriptors were used, both are implicitly detecting edges and

aggregating those edges into histograms. It was shown that the HSC features is per-

forming better than the HOG features, especially when it comes to reducing large

errors. However, HOG features are faster to compute than HSC features. To further

speed-up the computation of HSC features, the implementation can be parallelized and

implemented in a compiled language like C++.

Preliminary experiments for difficult lightings were conducted. A further improvement

of the results could be achieved by preprocessing over- and underexposed images. The

preprocessing should ensure that all edges are clearly visible because this would enable

a better detection of edges and therefore the utilized feature would be more descriptive.

To reduce the number of large errors, future work could include the detection of tail-

or headlights on a vehicle. This additional information could be used to prevent the

prediction of some angles. For example, when taillights are detected no front view angle

is allowed to be predicted. It should be noted that in a real world scenario redundant

systems are tracking the orientations of vehicles and sequential information is used. An

erroneous prediction from the vision-based orientation estimator can be detected by

other systems due to deviating information from these systems and can be overruled.

This work is using a dataset which consists only of daytime images. During the day,

the edges of vehicles can be captured quite well in most cases. At night, a camera

which captures only the visible light will most likely not be able to capture the edges

in the same way. Therefore, a night vision camera which records light from the infrared

60 CONCLUSION & FUTURE WORK

or ultraviolet spectrum could be of help. As there are no such datasets available at the

moment, an investigation regarding this is pending.

REFERENCES xiii

References

[AhEB06] Michael Aharon, Michael Elad and Alfred Bruckstein. K-SVD: An Algo-

rithm for Designing Overcomplete Dictionaries for Sparse Representation.

Signal Processing, IEEE Transactions on, 54(11), 2006, pp. 4311–4322.

[Barb10] David Barber. Bayesian Reasoning And Machine Learning. Cambridge

University Press. 2010.

[BETV08] Herbert Bay, Andreas Ess, Tinne Tuytelaars and Luc Van Gool. Speeded-

Up Robust Features (SURF). Computer Vision and Image Understanding,

110(3), 2008, pp. 346–359.

[BFSO84] Leo Breiman, Jerome Friedman, Charles J Stone and Richard A Olshen.

Classification and regression trees. CRC press, 1984.

[Bish06] Christopher Bishop. Pattern Recognition and Machine Learning. Informa-

tion Science and Statistics. Springer-Verlag, New York. 1 Edition, 2006.

[CBHK02] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip

Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique Nitesh.

Journal of Artificial Intelligence Research, Vol. 16, 2002, pp. 321–357.

[ChLi11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM : A Library for Support

Vector Machines. ACM Transactions on Intelligent Systems and Technology

(TIST), Vol. 2, 2011, pp. 1–39.

[CKZB15] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew Berneshawi, Huimin

Ma, Sanja Fidler and Raquel Urtasun. 3D Object Proposals for Accurate

Object Class Detection. In NIPS, 2015.

[CoHa67] Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classifica-

tion. IEEE Transactions on Information Theory, 13(1), 1967, pp. 21–27.

[CoVa95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine

learning, Vol. 20. Springer, 1995, pp. 273–297.

[DaMa06] Hal Daume III and Daniel Marcu. Domain Adaptation for Statistical

Classiers. Journal of Artificial Intelligence Research, Vol. 26, 2006, pp. 101–

126.

[DARP04] DARPA (Defense Advanced Research Projects Agency). Final

Data from DARPA Grand Challenge. http://archive.darpa.

mil/grandchallenge04/media/final_data.pdf [Accessed December, 18

2015], 2004.

http://archive.darpa.mil/grandchallenge04/media/final_data.pdf

http://archive.darpa.mil/grandchallenge04/media/final_data.pdf

xiv REFERENCES

[DaTr05] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human

Detection. 2005 IEEE Computer Society Conference on Computer Vision

and Pattern Recognition (CVPR’05), Vol. 1, 2005, pp. 886–893.

[DDSL09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei.

Imagenet: A large-scale hierarchical image database. Computer Vision

and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009,

pp. 248–255.

[DrHo03] Chris Drummond and R.C. Holte. C4.5, class imbalance, and cost sensitiv-

ity: why under-sampling beats over-sampling. Workshop on Learning from

Imbalanced Datasets II, 2003, pp. 1–8.

[Elka01] Charles Elkan. The foundations of cost-sensitive learning. International

joint conference on artificial intelligence, 2001.

[EnAH99] Kjersti Engan, Sven Ole Aase and John Hakon Husoy. Method of optimal

directions for frame design. 1999 IEEE International Conference on Acous-

tics, Speech, and Signal Processing. Proceedings. ICASSP99, Vol. 5, 1999,

pp. 2443–2446.

[FGMR10] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ra-

manan. Object detection with discriminatively trained part-based models.

Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9),

2010, pp. 1627–1645.

[FrSc97] Yoav Freund and Robert E. Schapire. A Decision-Theoretic Generalization

of On-Line Learning and an Application to Boosting. Journal of Computer

and System Sciences, 55(1), 1997, pp. 119–139.

[FSZC99] Wei Fan, Sj Stolfo, J Zhang and Pk Chan. AdaCost: Misclassification Cost-

Sensitive Boosting. ICML ’99 Proceedings of the Sixteenth International

Conference on Machine Learning, 1999, pp. 97–105.

[Fuku80] Kunihiko Fukushima. Neocognitron: A self-organizing neural network

model for a mechanism of pattern recognition unaffected by shift in po-

sition. Biological Cybernetics, 36(4), 1980, pp. 193–202.

[GeLU12] Andreas Geiger, Philip Lenz and Raquel Urtasun. Are we ready for Au-

tonomous Driving? The KITTI Vision Benchmark Suite. IEEE Conference

on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.

[GeWU11] Andreas Geiger, Christian Wojek and Raquel Urtasun. Joint 3D Estimation

of Objects and Scene Layout. Advances in Neural Information Processing

Systems (NIPS), 2011, pp. 1–9.

REFERENCES xv

[Girs15] Ross Girshick. Fast R-CNN. arXiv preprint arXiv:1504.08083, to appear

in ICCV 2015, 2015.

[GLOH11] Michael Gabb, Otto Lohlein, Matthias Oberlander and Gunther Heide-

mann. Efficient monocular vehicle orientation estimation using a tree-based

classifier. 2011 IEEE Intelligent Vehicles Symposium (IV), 2011, pp. 308–

313.

[Hars15] Alexander Hars. Key players in the driverless market. http://www.

driverless-future.com/?page_id=155 [Accessed December 18, 2015],

2015.

[HaTF09] Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements

of Statistical Learning. Springer Series in Statistics. Springer, Berlin. 2

Edition, 2009.

[Hawk04] Douglas M. Hawkins. The Problem of Overfitting. Journal of Chemical

Information and Computer Sciences, 44(1), 2004, pp. 1–12.

[HiOT06] Geoffrey E. Hinton, Simon Osindero and Yee Whye Teh. A fast learning

algorithm for deep belief nets. Neural computation, 18(7), 2006, pp. 1527–

54.

[Hote33] Harold Hotelling. Analysis of a complex of statistical variables into principal

components. In Journal of educational psychology, Vol. 24. Warwick & York,

1933, p. 417.

[HsLi02] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass

support vector machines. Neural Networks, IEEE Transactions on, 13(2),

2002, pp. 415–425.

[Hsu15] Jeremy Hsu. Autonomous Car Sets Record in Mexico. http://spectrum.

ieee.org/cars-that-think/transportation/self-driving/

driverless-car-sets-autonomous-driving-record-in-mexico [Ac-

cessed December 18, 2015], 2015.

[ITSA14] Intelligent Transportation Society of America ITSA. Accelerating Sustain-

ability: Demonstrating the Benefits of Transportation Technology. 2014.

[Japk00] Nathalie Japkowicz. The Class Imbalance Problem: Significance and Strate-

gies. Proceedings of the 2000 International Conference on Artificial Intelli-

gence, 2000, pp. 111–117.

[KoAl01] Aleksander Kolcz and Joshua Alspector. SVM-based Filtering of E-mail

Spam with Content-specific Misclassification Costs. Proceedings of the

http://www.driverless-future.com/?page_id=155

http://www.driverless-future.com/?page_id=155

http://spectrum.ieee.org/cars-that-think/transportation/self-driving/driverless-car-sets-autonomous-driving-record-in-mexico



xvi REFERENCES

TextDM’01 Workshop on Text Mining - held at the 2001 IEEE Interna-

tional Conference on Data Mining, 2001, pp. 1–14.

[Koha95] Ron Kohavi. A study of cross-validation and bootstrap for accuracy esti-

mation and model selection. In Internation Joint Conference on Artificial

Intelligence (IJCAI), Vol. 14, 1995, pp. 1137–1145.

[KrSH12] Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet Clas-

sification with Deep Convolutional Neural Networks. Advances In Neural

Information Processing Systems, 2012, pp. 1–9.

[KuMa97] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced

training sets: one-sided selection. In ICML, Vol. 97, 1997, pp. 179–186.

[LaSu97] Louisa Lam and Ching Y. Suen. Application of majority voting to pattern

recognition: an analysis of its behavior and performance. IEEE Transac-

tions on Systems Man and Cybernetics Part A Systems and Humans, 27(5),

1997, pp. 553–568.

[Lind98] Tony Lindeberg. Feature Detection with Automatic Scale Selection. Inter-

national Journal of Computer Vision, 30(2), 1998, pp. 79 – 116.

[LiWZ14] Bo Li, Tianfu Wu and Song-Chun Zhu. Integrating context and occlusion

for car detection by hierarchical and-or model. In Computer Vision–ECCV

2014. Springer, 2014, pp. 652–667.

[LiZh05] Yi Liu and Yuan F. Zheng. One-against-all multi-class SVM classifica-

tion using reliability measures. In Neural Networks, 2005. IJCNN ’05.

Proceedings. 2005 IEEE International Joint Conference on, Vol. 2, 2005,

pp. 849–854.

[LiZh06] Xu-Ying Liu and Zhi-Hua Zhou. The Influence of Class Imbalance on Cost-

Sensitive Learning: An Empirical Study. Sixth International Conference

on Data Mining (ICDM’06), 2006, pp. 970–974.

[Lowe99] David G. Lowe. Object recognition from local scale-invariant features. Pro-

ceedings of the Seventh IEEE International Conference on Computer Vi-

sion, 2([8), 1999, pp. 1150–1157.

[MaRS08] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. An

Introduction to Information Retrieval. Cambridge University Press. 2008.

[MaSn13] Kevin Matzen and Noah Snavely. NYC3DCars: A Dataset of 3D Vehicles

in Geographic Context. 2013 IEEE International Conference on Computer

Vision, 2013, pp. 761–768.

REFERENCES xvii

[MBPS09] Julien Mairal, Francis Bach, Jean Ponce and Guillermo Sapiro. Online

dictionary learning for sparse coding. Proceedings of the 26th International

Conference on Machine Learning, 2009, pp. 1–8.

[MoPV12] Douglas C Montgomery, Elizabeth A Peck and G Geoffrey Vining. Intro-

duction to Linear Regression Analysis, Vol. 821. John Wiley & Sons. 2012.

[NPSB10] Jesus Nuevo, Ignacio Parra, Jonas Sjoberg and Luis M. Bergasa. Estimating

surrounding vehicles’ pose using computer vision. 13th International IEEE

Conference on Intelligent Transportation Systems, 2010, pp. 1863–1868.

[OBTr14] Eshed Ohn-Bar and Mohan M. Trivedi. Fast and Robust Object Detection

Using Visual Subcategories. 2014 IEEE Conference on Computer Vision

and Pattern Recognition Workshops, 2014, pp. 179–184.

[OBTr15] Eshed Ohn-Bar and Mohan M. Trivedi. Learning to Detect Vehicles by

Clustering Appearance Patterns. In Intelligent Transportation Systems,

IEEE Transactions on, Vol. 16, 2015, pp. 2511–2521.

[OlFi97] Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete

basis set: A strategy employed by V1? Vision Research, 37(23), 1997,

pp. 3311–3325.

[Pear01] Karl Pearson. On lines and planes of closest fit to systems of points in space.

The London, Edinburgh, and Dublin Philosophical Magazine and Journal

of Science, 2(1), 1901, pp. 559–572.

[PSGS13] Bojan Pepikj, Michael Stark, Peter Gehler and Bernt Schiele. Occlusion

Patterns for Object Class Detection. 2013 IEEE Conference on Computer

Vision and Pattern Recognition, 2013, pp. 3286–3293.

[PSGS15] Bojan Pepikj, Michael Stark, Peter Gehler and Bernt Schiele. Multi-view

and 3D Deformable Part Models. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 37(1), 2015, pp. 2232–2245.

[Quin93] J. Ross Quinlan. C4.5: programs for machine learning. Elsevier, 1993.

[ReRa13] Xiaofeng Ren and Deva Ramanan. Histograms of Sparse Codes for Ob-

ject Detection. 2013 IEEE Conference on Computer Vision and Pattern

Recognition, 2013, pp. 3246–3253.

[RHMH10] Paul E. Rybski, Daniel Huber, Daniel D. Morris and Regis Hoffman. Visual

classification of coarse vehicle orientation using histogram of oriented gra-

dients features. IEEE Intelligent Vehicles Symposium, Proceedings, 2010,

pp. 921–928.

xviii REFERENCES

[Sing15] Santokh Singh. Critical reasons for crashes investigated in the National

Motor Vehicle Crash Causation Survey. National Highway Traffic Safety

Administration, (February), 2015, pp. 1–2.

[SiTr13] Sayanan Sivaraman and Mohan Manubhai Trivedi. Looking at Vehicles on

the Road: A Survey of Vision-Based Vehicle Detection, Tracking, and Be-

havior Analysis. IEEE Transactions on Intelligent Transportation Systems,

14(4), 2013, pp. 1773–1795.

[SKWW07] Yanmin Sun, Mohamed S. Kamel, Andrew K.C. Wong and Yang Wang.

Cost-sensitive boosting for classification of imbalanced data. Pattern Recog-

nition, 40(12), 2007, pp. 3358–3378.

[SLJS15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,

Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Ra-

binovich. Going Deeper with Convolutions. Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

[SmSc04] Alex J. Smola and Bernhard Scholkopf. A tutorial on support vector re-

gression. In Statistics and computing, Vol. 14. Springer, 2004, pp. 199–222.

[SmVi08] Alex Smola and S. V. N. Vishwanathan. Introduction to Machine Learning.

Cambridge University Press. 2008.

[TiBi99] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal

component analysis. Journal of the Royal Statistical Society, 61(3), 1999,

pp. 611–622.

[Ting02] Kai Ming Ting. An instance-weighting method to induce cost-sensitive

trees. IEEE Transactions on Knowledge and Data Engineering, 14(3), 2002,

pp. 659–665.

[TrKa98] Anne M. Treisman and Nancy G. Kanwisher. Perceiving visually presented

objects: Recognition, awareness, and modularity. Current Opinion in Neu-

robiology, 8(2), 1998, pp. 218–226.

[Vapn63] Vladimir Vapnik. Pattern recognition using generalized portrait method.

In Automation and remote control, Vol. 24, 1963, pp. 774–780.

[VaVa98] Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning

theory. Vol. 1. Wiley New York, 1998.

[ViJo01] P Viola and M Jones. Rapid object detection using a boosted cascade of

simple features. Proceedings of the IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR), Vol. 1, 2001, pp. 511–

518.

REFERENCES xix

[WaHY09] Xiaoyu Wang, Tony X. Han and Shuicheng Yan. An HOG-LBP human

detector with partial occlusion handling. Computer Vision, 2009 IEEE

12th International Conference on, 2009, pp. 32–39.

[WiDe11] Rob G J Wijnhoven and Peter H N De With. Unsupervised sub-

categorization for object detection: Finding cars from a driving vehicle.

Proceedings of the IEEE International Conference on Computer Vision,

2011, pp. 2077–2083.

[Worl15] World Health Organization. Global Status Report on Road Safety. 2015.

[XCLS15] Yu Xiang, Wongun Choi, Yuanqing Lin and Silvio Savarese. Data-Driven

3D Voxel Patterns for Object Category Recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, 2015,

pp. 1903–1911.

[XiMS14] Yu Xiang, Roozbeh Mottaghi and Silvio Savarese. Beyond PASCAL: A

benchmark for 3D object detection in the wild. IEEE Winter Conference

on Applications of Computer Vision, 2014, pp. 75–82.

[YBAL14] J. Javier Yebes, Luis M. Bergasa, Roberto Arroyo and Alberto Lazaro.

Supervised learning and evaluation of KITTI’s cars detector with DPM.

IEEE Intelligent Vehicles Symposium, Proceedings, 2014, pp. 768–773.

[YeBG15] J. Yebes, Luis M. Bergasa and Miguel Angel GarcıaGarrido. Visual Object

Recognition with 3D-Aware Features in KITTI Urban Scenes. Sensors,

15(4), 2015, pp. 9228–9250.

[YTAS11] Quan Yuan, Ashwin Thangali, Vitaly Ablavsky and Stan Sclaroff. Learn-

ing a family of detectors via multiplicative kernels. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 33(3), 2011, pp. 514–530.

[ZhLi10] Qiang Zhang and Baoxin Li. Discriminative K-SVD for dictionary learning

in face recognition. Proceedings of the IEEE Computer Society Conference

on Computer Vision and Pattern Recognition, 2010, pp. 2691–2698.

[ZYCA06] Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng and Shai Avidan. Fast

Human Detection Using a Cascade of Histograms of Oriented Gradients.

Computer Vision and Pattern Recognition, 2006 IEEE Computer Society

Conference on, Vol. 2, 2006, pp. 1491–1498.

Estimation of Vehicle Orientations using Histogram-based Image … · 2016. 1. 25. · Ste en...

Documents

Transcript of Estimation of Vehicle Orientations using Histogram-based Image … · 2016. 1. 25. · Ste en...