Joint Optimization for Object Class Segmentation and Dense ... · problems of object class...

Noname manuscript No.(will be inserted by the editor)

Joint Optimization for Object Class Segmentation andDense Stereo Reconstruction

�ubor Ladický · Paul Sturgess · Chris Russell · Sunando Sengupta ·Yalin Bastanlar · William Clocksin · Philip H. S. Torr

the date of receipt and acceptance should be inserted later

Abstract The problems of dense stereo reconstruction

and object class segmentation can both be formulated asRandom Field labeling problems, in which every pixelin the image is assigned a label corresponding to ei-ther its disparity, or an object class such as road orbuilding. While these two problems are mutually in-formative, no attempt has been made to jointly opti-mize their labelings. In this work we provide a �exibleframework con�gured via cross-validation that uni�esthe two problems and demonstrate that, by resolvingambiguities, which would be present in real world dataif the two problems were considered separately, joint op-

�ubor LadickýUniversity of Oxford,Oxford, UKE-mail: [email protected]

Paul Sturgess · Sunando Sengupta · P.H.S. TorrOxford Brookes University,Oxford, UK

Paul SturgessE-mail: [email protected]

Sunando SenguptaE-mail: [email protected]

P.H.S. TorrE-mail: [email protected]

Chris RussellQueen Mary College, University of London,London, UKE-mail: [email protected]

Yalin BastanlarIzmir Institute of Technology,Izmir, TurkeyE-mail: [email protected]

William ClocksinUniversity of Hertfordshire,Hat�eld, UKE-mail: [email protected]

timization of the two problems substantially improvesperformance. To evaluate our method, we augment theLeuven data set1, which is a stereo video shot from acar driving around the streets of Leuven, with 70 handlabeled object class and disparity maps. We hope thatthe release of these annotations will stimulate furtherwork in the challenging domain of street-view analysis.Complete source code is publicly available2.

Keywords :Object Class Segmentation, Dense Stereo Recon-

struction, Random Fields.

1 Introduction

Many tasks require both object class and depth label-ing. For an agent to interact with the world, it mustbe capable of recognizing both objects and their physi-cal location. For example, camera based driverless carsmust be capable of di�erentiating between road andother classes, recognizing where the road ends. Simi-larly, several companies (e.g. Yotta, 2011) wish to pro-vide an automatic annotation of assets (such as streetlight, drain or road sign) to local authorities. In orderto provide this service, assets must be identi�ed, local-ized in 3D space and an estimation of the quality of theassets made.

1 These annotations have been made available for download athttp://cms.brookes.ac.uk/research/visiongroup/files/Leuven.zip2 Available at http://cms.brookes.ac.uk/staff/PhilipTorr/ale.htm

This work is supported by EPSRC research grants, HMGCC,the IST Programme of the European Community, under the PAS-CAL2 Network of Excellence, IST-2007-216886. P. H. S. Torr isin receipt of Royal Society Wolfson Research Merit Award. ChrisRussell was partially funded by the European Research Councilunder the ERC Starting Grant agreement 204871-HUMANIS.

The problems of object class segmentation (Shottonet al, 2006; Ladicky et al, 2009), which assigns an objectlabel such as road or building to every pixel in the imageand dense stereo reconstruction, in which every pixelwithin an image is labeled with a disparity (Scharsteinand Szeliski, 2002), are well suited for being solvedjointly. Both approaches formulate the problem of pro-viding a correct labeling of an image as one of Max-imum a Posteriori (map) estimation over a RandomField (rf), which is typically a Potts or truncated lin-ear model. Thus both may use graph cut based movemaking algorithms, such as α-expansion (Boykov et al,2001), to solve the labeling problem. These problemsshould be solved jointly, as a correct labeling of objectclass can help depth labeling, and stereo reconstruc-tion can improve object labeling. Indeed it opens thepossibility for the generic stereo priors used previouslyto be enriched by information about the shape of spe-ci�c objects. For instance, object class boundaries aremore likely to occur at a sudden transition in depth andvice versa, while the height of a point above the groundplane is an extremely informative cue regarding its ob-ject class label; e.g. road or sidewalk lie in the groundplane, and pixels taking labels pedestrian or carmust lieat a constrained height above the ground plane, whilepixels taking label sky must occur at an in�nite depth(zero disparity) from the camera. Figure 1 shows ourmodel which explicitly captures these properties.

Object recognition provides substantial informationabout the 3D location of points in the image. This hasbeen exploited in recent work on single view reconstruc-tion (Hoiem et al, 2005; Ramalingam et al, 2008; Gouldet al, 2009; Liu et al, 2010), in which a plausible pop-upplanar model of a scene is reconstructed from a singlemonocular image using object recognition and prior in-formation regarding the location of objects in typicallyphotographed scenes. Such approaches only estimatedepth from object class, assuming the object class isknown. As object recognition is itself a problem full ofambiguity and often requiring knowledge of 3D such atwo stage process must, in many cases, be suboptimal.

Other works have taken the converse approach of us-ing of 3D information in inferring object class; Hoiemet al (2006) showed how knowledge of the camera view-point and the typical 3D location of objects can be usedto improve object detection, while Leibe et al (2007)employed Structure-from-Motion (SfM) techniques toaid the tracking and detection of moving objects. How-ever, neither object detection nor the 3D reconstruc-tion obtained gave a dense labeling of every pixel inthe image, and the �nal results in tracking and de-tection were not used to re�ne the SfM results. TheCamVid (Brostow et al, 2008) data set provides sparse

SfM cues, which have been used by several object classsegmentation approaches (Brostow et al, 2008; Sturgesset al, 2009) to generate pixel based image labeling. Inthese the object class segmentation was not used to re-�ne the 3D structure.

Previous works have attempted to simultaneouslysolve the problems of object class detection and 3D re-construction. Hoiem et al (2007) �tted a 3D model tospeci�c objects, such as buses or cars within an imageby simultaneously estimating 3D location, orientationand object class, while Dick et al (2004) �tted a 3Dmodel of a building to a set of images by simultane-ously estimating a wire-frame model and the locationof assets such as window or column. In both of these pa-pers the 3D models are intended to be plausible ratherthan accurate, and these models are incomplete � theydo not provide location or class estimates of every pixel.

None of the discussed works perform joint inferenceto obtain dense stereo reconstruction and object classsegmentation. In this work, we demonstrate that theseproblems are mutually informative, and bene�t frombeing solved jointly. We consider the problem of scenereconstruction in an urban area (Leibe et al, 2007).These scenes contain object classes such as road, car andsky that vary in their 3D locations. Compared to typicalstereo data sets that are usually produced in controlledenvironments, stereo reconstruction on this real worlddata is noticeably more challenging due to large homo-geneous regions and problems with photo-consistency.We e�ciently solve the problem of joint estimation ofobject class and depth using modi�ed variants of theα-expansion (Boykov et al, 2001), and range move al-gorithms (Kumar et al, 2011).

No real world data sets are publicly available thatcontain both per pixel object class and dense stereodata. In order to evaluate our method, we augmentedthe data set of Leibe et al (2007) by creating hand la-beled object class and disparity maps for 70 images.These annotations have been made available for down-load1. Our experimental evaluation demonstrates thatjoint optimization of dense stereo reconstruction andobject class segmentation leads to a substantial im-provement in the accuracy of �nal results.

The structure of the paper is as follows: In section 2we give the generic formulation of rfs for dense imagelabeling, and describe how they can be applied to theproblems of object class segmentation and dense stereoreconstruction. Section 3 describes the formulation al-lowing for the joint optimization of these two problems,while section 4 shows how the optimization can be per-formed e�ciently. The data set is described in section5 and experimental validation follows in 6.

Fig. 1 Graphical model of our joint rf. The system takes a left (A) and right (B) image from a stereo pair that has been recti�ed. Ourformulation captures the dependencies between the object class segmentation problem (E, �2.1) and the dense stereo reconstructionproblem (F, �2.2) by de�ning a joint energy on the recognition and disparity labels both on the unary/pixel (blue) and pairwise/edgevariables (green) of both problems. The unary potentials of the joint problem encodes the fact that di�erent objects will have di�erentheight distributions (G,eq. (3.1)) learned from our training set containing hand labeled disparities (�5). The pairwise potentialsencode that object class boundaries, and sudden changes in disparity are likely to occur together, but could also encode di�erent shapesmoothness priors for di�erent types of object. The combined optimization results in an approximate object class segmentation (C)and dense stereo reconstruction (D). See �3 and �4 for a full treatment of our model and �6 for further results. Best viewed in color.

2 Overview of the Random Field Formulations

Our joint optimization consists of two parts, objectclass segmentation and dense stereo reconstruction. Be-fore we formulate our approach we give an overview ofthe typically used random �eld (rf) formulation forboth problems and introduce the notation used in sec-tion 3. Both problems have previously been de�nedas a dense rf where the set of random variables Z =

{Z1, Z2, . . . , ZN} corresponds to the set of all image pix-els i ∈ V = {1, 2, . . . , N}. A clique c ∈ C is a set ofrandom variables Zc ⊆ Z. Any possible assignment oflabels to the random variables will be called a labeling

and denoted by z, similarly we use zc to denote the la-beling of a clique. Each zi ∈ L, where L is the set oflabels. Figure 1 E & F depict this lattice structure as

a blue dotted grid, the variables Zi are shown as bluecircles.

Random �eld formulation can be seen as a struc-tured classi�er minimizing the cost of the labeling z:

z∗ = argminzE(z) = argmin

z

∑c∈C

ψc(zc), (1)

where C is the set of all cliques. The term ψc(zc) isknown as the potential function of the clique c ⊂ Vwhere zc = {zi : i ∈ c}. Potential functions typicallytake the form:

ψc(zc) = wc · Φc(zc), (2)

where Φc(zc) is a cost function vector containing a costfor each con�guration of zc and wc a weight vectorweighting the importance of each cost function. The

weight vectors are constant for all cliques for each typeof potential (unary, pairwise, ..). Even though there ex-ists an underlying probabilistic distribution correspond-ing to any rf, the state-of-the-art algorithms for learn-ing the potential functions ψc(·) are typically traineddiscriminatively and the �nal classi�er does not haveany real probabilistic interpretation. Thus, all the weightsand parameters are either hand-tuned on the valida-tion set or trained using any discriminative max-marginmethod (Taskar et al, 2004; Tsochantaridis et al, 2005;Alahari et al, 2010). Probabilistic interpretations whilsttheoretically well grounded are hard to achieve in prac-tise, as the probabilistic distributions are exceptionallydi�cult to model.

2.1 Object Class Segmentation using a RF

We follow (Shotton et al, 2006; Kohli et al, 2008; Ladickyet al, 2009) in formulating the problem of object classsegmentation as �nding a minimal cost labeling of a rfde�ned over a set of random variables X = {X1, . . . , XN}each taking a state from the label space L = {l1, l2, . . . , lk}.Each label lj indicates a di�erent object class such ascar, road, building or sky. These energies take the form:

EO(x) =∑

i∈V ψOi (xi) +

∑i∈V,j∈Ni

ψOij(xi, xj)

+∑

c∈C ψOc (xc).

(3)

The unary potential ψOi of the rf describes the cost of

a single pixel taking a particular label. We followed theapproach in (Ladicky et al, 2009; Sturgess et al, 2009),where the unary cost for a pixel taking certain label isbased on the boosted (Torralba et al, 2004) classi�erbased on shape �lters (Shotton et al, 2006) and multi-ple feature responses (Ladicky et al, 2009). We refer thereader to (Shotton et al, 2006; Ladicky et al, 2009) formore details. The pairwise terms ψO

ij encourage sim-ilar neighboring pixels in the image to take the samelabel. These potentials are shown in �gure 1 E as bluecircles and green squares respectively. ψO

ij(xi, xj) takesthe form of a contrast sensitive Potts model:

ψOij(xi, xj) =

{0 if xi = xj ,

g(i, j) otherwise,(4)

where the function g(i, j) is an edge feature based onthe di�erence in colors of neighboring pixels (Boykovand Jolly, 2001), typically de�ned as:

g(i, j) = θp + θv exp(−θβ ||Ii − Ij ||22), (5)

where Ii and Ij are the color vectors of pixel i andj respectively. θp, θv, θβ ≥ 0 are model parameterslearned using training data. We refer the interested

Fig. 2 An illustration of how 3D information can be recon-structed from a stereo camera rig. Also shown, the relation be-tween disparity (the movement of a point between the pair ofimages) and height, once ground plane is known.

reader to (Boykov and Jolly, 2001; Rother et al, 2004;Shotton et al, 2006) for more details.

The higher order terms ψOc (xc) describe potentials

de�ned over cliques containing more than two pixels. Inour work we follow (Ladicky et al, 2009) and use theirhierarchical potentials based upon histograms of fea-tures, evaluated on segments, obtained by unsupervisedsegmentation methods (Comaniciu and Meer, 2002; Shiand Malik, 2000). This signi�cantly improves the resultsof object class segmentation method. Nearly all cur-rent rf based object class segmentation methods (Ra-binovich et al, 2007; Batra et al, 2008) can be repre-sented within this formulation via di�erent choices forthe higher order cliques (Ladicky et al, 2009; Russellet al, 2010) and can be included in the framework.

2.2 Dense Stereo Reconstruction using a RF

We use the energy formulation of (Boykov et al, 2001;Scharstein and Szeliski, 2002) for the dense stereo re-construction 2.2 part of our joint formulation. They for-mulated the problem as one of �nding a minimal costlabeling of a rf de�ned over a set of random variablesY = {Y1, . . . , YN}, where each variable Yi takes a statefrom the label space D = {d1, d2, . . . , dm} correspond-ing to a set of disparities, and can be written as:

ED(y) =∑i∈V

ψDi (yi) +

∑i∈V,j∈Ni

ψDij (yi, yj). (6)

The unary potential ψDi (yi) of the rf is de�ned as a

measure of color agreement of a pixel with its corre-sponding pixel i from the stereo-pair given a choice ofdisparity yi. The pairwise terms ψD

ij encourage neigh-boring pixels in the image to have a similar disparity.

Fig. 3 An illustration of how 3D information can be recon-structed from the monocular sequence. Details of the conversionof the monocular 3D reconstruction problem into the standardstereo reconstruction are given in �2.3.

The cost is a function of the distance between disparitylabels:

ψD(yi, yj) = f(|yi − yj |), (7)

where f(.) usually takes the form of a linear truncatedfunction f(y) = min(k1y, k2), where k1, k2 ≥ 0 are theslope and truncation respectively. The unary (blue cir-cles) and pairwise (green squares) potentials are shownin �gure 1 F. Note that the disparity for a pixel isdirectly related to the depth of the corresponding 3Dpoint (See �gure 2). To partially resolve ambiguities indisparities for low textured objects a Gaussian �lter isapplied to the unary potentials.

2.3 Monocular Video Reconstruction

With minor modi�cation, the formulation of 2.2 canalso be applied to monocular video sequences, by per-forming stereo reconstruction over adjacent frames inthe video sequence (See �gure 3). Under the simplifyingassumption that the scene remains static, the formula-tion remains the same. However, without a �xed base-line between the camera positions in adjacent framesthe estimation of disparities, and the mapping of dis-parities to depths is more complex.

We �rst pre-process the data, by performing sift

matching (Lowe, 2004) over adjacent frames, before us-ing Ransac (Fischler and Bolles, 1981; Torr and Mur-ray, 1997) to simultaneously estimate the fundamentalmatrix, and a corresponding set of inliers from thesematches. The fundamental matrix gives us both theepipoles3 and the epipolar lines, and this allows us to

3 The epipoles typically lie within the image as the camerapoints in the direction of motion.

solve the stereo correspondence e�ciently by searchingalong corresponding epipolar lines for a match. Giventwo images 1, and 2, we write x, x′ for a pair of matchedpoints in images 1 and 2 respectively, and use e, e′ forthe epipoles present in each image. The disparity d isestimated as:

d = | |e− x| − |e′ − x′| | . (8)

Note that we compute the disparity between pixels in aparticular frame with those in its previous frame. As thecamera moves forward into the image, this guaranteesthat every unoccluded pixel can be matched. Matchingpixels from the current frame against the next wouldmean that pixels about the edge of the image could notbe matched. As with standard stereo reconstruction,the unary potential of a particular choice of disparity,or equivalently a match between two pixels, is de�nedas the pixel di�erence in RGB space between them.

Converting Monocular Disparity to Stereo Disparity

Unlike conventional stereo, disparities in our video se-quence are not simply inversely proportional to dis-tances, but also depend on other variables. There aretwo reasons for this:

� Firstly, the distance traveled between frames by thecamera varies with the speed of the vehicle and thisimplies that the baseline varies from frame to frame.

� Secondly, when the epipole lies in the image thecamera can not be approximated as orthographic.The e�ective baseline, which we de�ne as the com-ponent of the baseline normal to the ray, varies sub-stantially within an image from pixel to pixel.

We will describe how disparities in the monocular se-quence correspond to distances, and use this to mapthem into standard form stereo disparities. This allowsus to reuse the joint potentials learned for the stereocase, and to directly evaluate both approaches by com-paring against the same ground truth.

We de�ne a ray λr, as the set of all values takenby a 3D unit vector r, multiplied by a scalar λ ∈ ℜ.We de�ne the baseline Bf as the 3D distance traveledby the camera between a pair of frames f and f + 1 4.We let θ be the angle between B and r. Then we de�nee the epipole, as the intersection point of the baselineand the image plane, and x as the point in the imagethat the ray λr passes through. Given a disparity d of apoint on the ray, the distance s of that point from thecamera is:

s = K|(Bf −Bf · r)|/d= K|Bf |

√1− cos2 θ/d

= K|Bf | × | sin θ|/d,(9)

4 This value is a part of the standard Leuven data-set, see �5,and does not require estimating, in our application, see �6.

where K is a constant based on the internal propertiesof the camera.

Noting that |e−x| ∝ tan θ, i.e. γ|e−x| = tan θ for

some value γ, and that | sin θ| =√

tan2 θ1+tan2 θ , we have

s = K|Bf |

√γ2(e− x)2

1 + γ2(e− x)2/d. (10)

Solving s for a conventional stereo pair gives the relatedequation

s = K|B′|/d′, (11)

where K is the same constant based on intrinsic cam-era parameters, |B′| is the distance between the pairsof cameras, assumed to be constant and orthogonal tothe �eld of view of both cameras, and d′ is the stereodisparity. Matching the two equations, and eliminatings, we have

d′ =|B′||Bf |

d√γ2(e−x)2

1+γ2(e−x)2

. (12)

In case the movement of the camera is very close totranslation, orthogonal to the image plane, γ is su�-ciently small and the disparity can be approximatedby:

d′ ≈ |B′| d|Bf | γ |e− x|

. (13)

Given this relationship, unary potentials de�ned overthe monocular disparity d, can be mapped to unary po-tentials over the conventional stereo disparity d′. Thisallows standard stereo reconstruction on monocular se-quences to be performed as in section 2.2, and jointobject class and 3D reconstruction from monocular se-quences to be performed as described in the followingsection.

3 Joint Formulation of Object Class Labelingand Stereo Reconstruction

We formulate simultaneous object class segmentationand dense stereo reconstruction as an energy minimiza-tion of a dense labeling z over the image. Each randomvariable Zi = [Xi, Yi]

5 takes a label zi = [xi, yi], fromthe product space of object class and disparity labelsL ×D and correspond to the variable Zi taking objectlabel xi and disparity yi. In general the energy of therf for joint estimation can be written as:

E(z) =∑

i∈V ψJi (zi) +

∑i∈V,j∈Ni

ψJij(zi, zj)

+∑

c∈C ψJc (zc),

(14)

5 [Xi, Yi] is the ordered pair of elements Xi and Yi.

where the terms ψJi , ψ

Jij and ψ

Jc are a sum of the pre-

viously mentioned terms ψOi and ψD

i , ψOij and ψ

Dij , and

ψOc and ψD

c respectively, plus some terms ψCi , ψ

Cij , ψ

Cc ,

which govern interactions between X and Y. However,in our case ED(y) (see �2.2) does not contain higherorder terms ψD

c , and the joint energy is de�ned as:

E(z) =∑

i∈V ψJi (zi) +

∑i∈V,j∈Ni

ψJij(zi, zj)

+∑

c∈C ψOc (xc).

(15)

If the interaction terms ψCi , ψ

Cij are both zero, then

the problems x and y are independent of one anotherand the energy would be decomposable into E(z) =EO(x) +ED(y) and the two sub-problems could eachbe solved separately. However, in many real world datasets such as the one we describe in �5, this is not thecase, and we would like to model the unary and pair-wise interaction terms so that a joint estimation maybe performed.

3.1 Joint Unary Potentials

In order for the unary potentials of both the object classsegmentation and dense stereo reconstruction parts ofour formulation to interact, we need to de�ne somefunction that relates X and Y in a meaningful way.We could use depth and objects directly, as it may bethat certain objects appear more frequently at certaindepths in some scenarios. In road scenes we could buildstatistics relative to an overhead view where the po-sitioning of the objects in the ground plane may beinformative, since we expect that buildings will lie onthe edges of the ground plane, sidewalk will tend to liebetween building and road which would occupy the cen-tral portion of the ground plane. Building statistics withregard to the real-world positioning of objects gives astable and meaningful cue that is invariant to the cam-era position. However, models such as this require asubstantial amount of data to avoid over-�tting.

In this paper we need to model these interactionswith limited data. We do this by restricting our unaryinteraction potential to only modeling the observed factthat certain objects occupy a particular range of realworld heights. After calibration we are able to obtainthe height above the ground plane via the relation:

h(yi, i) = hc +(yh − yi)b

d, (16)

where hc is the camera height, yh is the level of thehorizon in the recti�ed image pair, yi is the height ofthe ith pixel in the image, b is the baseline betweenthe stereo pair of cameras and d is the disparity. This

relationship is modeled by estimating the a priori costof pixel i taking label zi = [xi, yi] by

ψCi ([xi, yi]) = − log(H(h(yi, i)|xi)), (17)

where

H(h|l) =∑

i∈T δ(xi = l)δ(h(yi, i) = h)∑i∈T δ(xi = l)

(18)

is a histogram based measure of the naive probabilitythat a pixel taking label l has height h in the trainingset T . The combined unary potential for the joint rfis:

ψJi ([xi, yi]) = wu

OψOi (xi) +w

uDψ

Di (yi)

+wuCψ

Ci (xi, yi),

(19)

where ψOi , and ψ

Di ,are the previously discussed costs of

pixel i being a member of object class xi or disparityyi given the image. wu

O, wuD, and w

uC are weights. Fig-

ure 1 G gives a graphical representation of this type ofinteraction shown as a blue line linking the unary po-tentials (blue circles) of x and y via a distribution ofobject heights.

3.2 Joint Pairwise Interactions

Pairwise potentials enforce the local consistency of ob-ject class and disparity labels between neighboring pix-els. The consistency of object class and disparity are notfully independent � an object classes boundary is morelikely to occur here if the disparity of two neighboringpixels signi�cantly di�er. To take this information intoaccount, we chose tractable pairwise potentials of theform:

ψJij([xi, yi], [xj , yj ]) = wp

OψOij(xi, xj) + wp

DψDij (yi, yj)

+wpCψ

Oij(xi, xj)ψ

Dij (yi, yj),

(20)

where wpO, w

pD > 0 and wp

C are weights of the pairwisepotential. Figure 1 shows this linkage as green line be-tween a pairwise potential (green box) of each part.

4 Inference for the Joint RF

Optimization of the energy E(z) is challenging. Eachrandom variable takes a label from the set L × D con-sequentially, in the experiments we consider (see � 5)they have 700 possible states. As each image contains316 × 256 random variables, there are 700316×256 pos-sible solutions to consider. Rather than attempting tosolve this problem exactly, we use graph cut based movemaking algorithms to �nd an approximate solution.

Graph cut based move making algorithms start froman initial solution and proceed by making a series ofmoves or changes, each of which leads to a solution oflower energy. The algorithm is said to converge whenno lower energy solution can be found. In the problemof object class labeling, the move making algorithm α-expansion can be applied to pairwise (Boykov et al,2001) and to higher order potentials (Kohli et al, 2007,2008; Ladicky et al, 2009) and often achieves the bestresults; while in dense stereo reconstruction, the trun-cated convex priors (see � 2.2) mean that better solu-tions are found using range moves (Kumar et al, 2011)than with α-expansion.

In object class segmentation, α-expansion moves al-low any random variable Xi to either retain its currentlabel xi or transition to the label α. More formally,given a current solution x the α-expansion algorithmsearches through the space Xα of size 2N , where N isthe number of random variables, to �nd the optimalsolution, where

Xα ={x′ ∈ LN : x′i = xi or x′i = α

}. (21)

In dense stereo reconstruction, a range expansionmove de�ned over an ordered space of labels, allows anyrandom variable Yi to either retain its current label yior take any label l ∈ [la, la + r]. That is to say, given acurrent solution y a range move searches through thespace Yl of size (r + 1)N , which we de�ne as:

Yl ={y′ ∈ DN : y′i = yi or y′i ∈ [l, l + r]

}. (22)

A single iteration of α-expansion, is completed whenone expansion move for each l ∈ L has been performed.Similarly, a single iteration of range moves is completedwhen |D| − r, moves has been performed.

4.1 Projected Moves

Under the assumption that energy E(z) is a metric(as in object class segmentation see �2.1) or a semi-metric (Boykov et al, 2001) (as in the costs of �2.2 and�3) over the label space L × D, either α-expansion orαβ swap respectively can be used to minimize the en-ergy. One single iteration of α-expansion would requireO(|L||D|) graph cuts to be computed, while αβ swaprequires O(|L|2|D|2) resulting in slow convergence. Inthis sub-section we show graph cut based moves can beapplied to a simpli�ed, or projected, form of the prob-lem that requires only O(|L| + |D|) graph cuts per it-eration, resulting in faster convergence and better so-lutions. The new moves we propose are based upon apiecewise optimization that improves by turn �rst ob-ject class labeling and then depth.

We call a move space projected if one of the com-ponents of z, i.e. x or y, remains constant for all con-sidered moves. Alternating between moves in the pro-jected space of x or of y can be seen as a form of hillclimbing optimization in which each component is indi-vidually optimized. Consequentially, moves applied inthe projected space are guaranteed not to increase thejoint energy after the move and must converge to a localoptima.

We will now show that for energy (15), projectedα-expansion moves in the object class label space andrange moves in the disparity label space are of the stan-dard form, and can be optimized by existing graphcut constructs. We note that �nding the optimal rangemove or α-expansion with graph cuts requires that thepairwise and higher order terms are constrained to aparticular form. This constraint allows the moves to berepresented as a pairwise submodular energy that canbe e�ciently solved using graph cuts (Boykov and Kol-mogorov, 2004); however neither the choice of unarypotentials nor scaling the pairwise or higher order po-tentials by a non-negative amount λ ≥ 0 a�ects if themove is representable as a pairwise sub-modular cost.

4.2 Expansion moves in the object class label space

For our joint optimization of disparity and object classes,we propose a new move in the projected object-class la-bel space. We allow each pixel taking label zi = [xi, yi]to either keep its current label or take a new label [α, yi].Formally, given a current solution z = [x,y] the algo-rithm searches through the space Zα of size 2N . Wede�ne Zα as:

Zα =

{z′ ∈ (L ×D)N : z′i = [x′i, yi] and

(x′i = xi or x′i = α)

}. (23)

One iteration of the algorithm involves making movesfor all α in L in some order successively. As discussedearlier, the values of the unary potential do not a�ectthe sub-modularity of the move. For joint pairwise po-tentials (20) under the assumption that y is �xed, wehave:

ψJij([xi, yi], [xj , yj ]) = (wp

O + wpCψ

Dij (yi, yj))ψ

Oij(xi, xj)

+ wpDψ

Dij (yi, yj)

= λijψOij(xi, xj) + kij . (24)

The constant kij does not a�ect the choice of opti-mal move and can safely be ignored. If ∀yi, yj λij =

wpO + wp

CψDij (yi, yj) ≥ 0, the projection of the pairwise

potential is a Potts model and standard α-expansionmoves can be applied. For wp

O ≥ 0 this property holdsif wp

O + wpCk2 ≥ 0, where k2 is de�ned as in �2.2. In

practice we use a variant of α-expansion suitable forhigher order energies (Russell et al, 2010).

4.3 Range moves in the disparity label space

For our joint optimization of disparity and object classeswe propose a new move in the project disparity labelspace. Each pixel taking label zi = (xi, yi) can eitherkeep its current label or take a new label from the range(xi, [la, lb]). To formalize this, given a current solutionz = [x,y] the algorithm searches through the space Zl

of size (2 + r)N , which we de�ne as:

Zl =

{z′ ∈ (L ×D)N : z′i = [xi, y

′i] and

(y′i = yi or y′i ∈ [l, l + r])

}. (25)

As with the moves in the object class label space, thevalues of the unary potential do not a�ect the sub-modularity of this move. Under the assumption thatx is �xed, we can write our joint pairwise potentials(20) as:

ψJij([xi, yi], [xj , yj ]) = (wp

D + wpCψ

Oij(xi, xj))ψ

Dij (yi, yj)

+ wOd ψ

Oij(xi, xj)

= λijψDij (yi, yj) + kij . (26)

Again, the constant kij can safely be ignored, and if∀xi, xj λij = wp

D + wpCψ

Oij(xi, xj) ≥ 0 the projection of

the pairwise potential is linear truncated and standardrange expansion moves can be applied. This propertyholds if wp

D +wpC(θp + θv) ≥ 0, where θp and θv are the

weights of the Potts pairwise potential (see �2.1).

5 Data set

We augment a subset of the Leuven stereo data set6

of (Leibe et al, 2007) with object class segmentation anddisparity annotations. The Leuven data set was chosenas it provides image pairs from two cameras, 150cmapart from each other, mounted on top of a movingvehicle, in a public urban setting. In comparison withother data sets, the larger distance between the twocameras allows better depth resolution, while the realworld nature of the data set allows us to con�rm ourstatistical model's validity. However, the data set doesnot contain the object class or disparity annotations,we require to learn and quantitatively evaluate the ef-fectiveness of our approach.

To augment the data set all image pairs were rec-ti�ed, and cropped to 316 × 256, then the subset of

6 This data set is available for download athttp://www.vision.ee.ethz.ch/~bleibe/cvpr07/datasets.html

Fig. 4 Quantitative comparison of the performance of dispar-ity rfs. We can clearly see that our joint approach �3 (ProposedMethod) outperforms standard dense stereo approaches based onthe Potts (Kolmogorov and Zabih, 2001) (Potts Baseline), Lin-ear truncated models described in �2.2 (LT Baseline) and Lineartruncated with Gaussian �ltered unary potentials (LT Filtered).The correct pixel ratio is the proportion of pixels which satisfy|di − dgi | ≤ δ, where di is the disparity label of i-th pixel, dgiis corresponding ground truth label and δ is the allowed error.See �6 for discussion.

70 non-consecutive frames was selected for human an-notation. The annotation procedure consisted of twoparts. Firstly we manually labeled each pixel in everyimage with one of 7 object classes: Building, Sky, Car,Road, Person, Bike and Sidewalk. An 8th label, Void, isgiven to pixels that do not obviously belong to one ofthese classes. Secondly disparity maps were generatedby manually matching by hand the corresponding pla-nar polygons, some examples of which are shown in the�gure 5 A, B, and D.

We believe our augmented subset of the Leuvenstereo data set to be the �rst publicly available data setthat contains both object class segmentation and densestereo reconstruction ground truth for real world data.This data di�ers from commonly used stereo matchingsets like the Middlebury (Scharstein and Szeliski, 2002)data set, as it contains challenging large regions whichare homogeneous in color and texture, such as sky andbuilding, and su�ers from poor photo-consistency dueto lens �ares in the cameras, specular re�ections fromwindows and inconsistent luminance between the leftand right camera. It should also be noted that it di�ersfrom the CamVid database (Brostow et al, 2008) in twoimportant ways, CamVid is a monocular sequence, andthe 3D information comes in the form of a set of sparse3D points with outliers7. These di�erences give rise toa challenging new data set that is suitable for training

7 The outlier rejection step was not performed on the 3D pointcloud in order to exploit large re-projection errors as cues formoving objects. See (Brostow et al, 2008) for more details.

and evaluating models for dense stereo reconstruction,2D and 3D scene understanding, and joint approachessuch as ours.

6 Experiments

For training and evaluation of our method we split thedata set (�5) into three sequences: Sequence 1, frames0-447; Sequence 2, frames 512-800; Sequence 3, frames875-1174. Augmented frames from sequence 1 and 3 areselected for training and validation, and sequence 2 fortesting. All void pixels are ignored. Due to insu�cientsize of the data the class Person is also set to void andthe parameters for object class domain were chosen thesame as in Ladicky et al (2009). The depth domainand joint parameters were learnt on the training setsame as in Ladicky et al (2009). The performance onthe training set in the depth domain is not signi�cantlybetter than on the test set and this approach does notlead to an over-�tting of the parameters.

We quantitatively evaluate the object class segmen-tation by measuring the percentage of correctly pre-dicted labels over non void pixels in the test sequence.The dense stereo reconstruction performance is quan-ti�ed by measuring the number of pixels which satisfy|di − dgi | ≤ δ, where di is the label of i-th pixel, dgi iscorresponding ground truth label and δ is the allowederror. We increment δ from 0 (exact) to 20 (within 20disparities) giving a clear picture of the performance.The total number of disparities used for evaluation is100.

6.1 Object Class Segmentation

The object class segmentation rf as de�ned in �2.1performed extremely well on the data set, better thanwe had expected, with 95.7% of predicted pixel labelsagreeing with the ground truth. Qualitatively we foundthat the performance is stable over the entire test se-quence, including those images without ground truth.Quantitative comparison of the stand alone and jointmethod is given in table 1.

6.2 Dense Stereo Reconstruction

The Potts (Kolmogorov and Zabih, 2001) and lineartruncated (LT) baseline dense stereo reconstruction mod-els described in �2.2 performed relatively well, withlarge δ, considering the di�culty of the data, plottedin �gure 4 as `Potts baseline' and `LT baseline'. Wefound that on our data set a signi�cant improvement

Fig. 5 Qualitative object class and disparity results for Leuven data set.(A) Original Image. (B) Object class segmentation groundtruth. (C) Proposed method Object class segmentation result. (D) Dense stereo reconstruction ground truth. (E) Proposed methoddense stereo reconstruction result. (F) Stand alone dense stereo reconstruction result (LT). Best viewed in color.

Global Building Sky Car Road Sidewalk Bike

Stand alone 95.7 96.7 99.8 93.5 99.0 60.2 59.3Joint approach 95.8 96.7 99.8 94.0 98.9 60.6 59.5

Table 1 Quantitative results for object class segmentation of stand alone and joint approach. The pixel accuracy (%) for di�erentobject classes. The `global' measure corresponds to the total proportion of pixels labeled correctly. Per class accuracy corresponds torecall measure commonly used for this task (Shotton et al, 2006; Sturgess et al, 2009; Ladicky et al, 2009). Minor improvement wereachieved for smaller classes that had fewer pixels present in the data set. We assume the di�erence would be larger for harder datasets.

was gained by smoothing the unary potentials with aGaussian blur8 before incorporating the potential in therf framework with linear truncated model, as can beseen in �gure 4 `LT Filtered'. For qualitative results see�gure 5 E

6.3 Joint Approach

Our joint approach de�ned in sections �3 and �4 consis-tently outperformed the best stand-alone dense stereoreconstruction as can be seen in �gure 4. Improvementof the object class segmentation was less dramatic, with95.8% of predicted pixel labels agreeing with the groundtruth. We expect to see a more signi�cant improvement

8 This is a form of robust matching measure, see �3.1 of(Scharstein and Szeliski, 2002) for further examples.

on more challenging data sets, and the creation of animproved data set is part of our future work. Qualita-tive results can be seen in �gure 5 C and E.

6.4 Monocular Reconstruction

Reconstruction from a monocular sequence is substan-tially harder than the corresponding stereo problem.Not only does it su�er from the same problems of vary-ing illumination and homogeneous regions, but the ef-fective base-line is substantially shorter making it muchharder to recover 3D information with any degree ofaccuracy, particularly in the region around the epipole(see �2.3 and �gure 7). Despite this, plausible 3D recon-struction is still possible, particularly when performingjoint inference over object class and disparity simulta-neously, quantitative results can be seen in �gure 6.

Fig. 6 Quantitative comparison of the performance of dispar-ity rfs, on monocular sequences. As with the stereo pair, we canclearly see that our joint approach �3 (Proposed Method) out-performs the stand alone approaches with baseline Potts (Kol-mogorov and Zabih, 2001) (Potts Baseline), Linear truncated po-tentials �2.2 (LT Baseline) and Linear truncated with Gaussian�ltered unary potentials (LT Filtered). The correct pixel ratio isthe proportion of pixels which satisfy |di−dgi | ≤ δ, where di is thedisparity label of i-th pixel, dgi is corresponding ground truth labeland δ is the allowed error. See �6.4 for discussion, and �gure 5to compare against conventional stereo.

Note that the joint optimization of monocular dispar-ity and object class out performs the pre-existing meth-ods (LT Baseline and Potts Baseline) over conventionaltwo camera stereo data, and is comparable to the twocamera results on LT �ltered. In �gure 7 qualitativeresults can be seen. As expected, these show the qual-ity of reconstruction improves with the distance fromthe epipole. Consequentially, one of the regions mostsuccessfully reconstructed is marked as void in the twocamera disparity maps, as it is not in the �eld of viewof both cameras. This suggests that the numeric evalu-ation of �gure 6 may be overly pessimistic.

7 Conclusion

Traditionally the prior in stereo has been �xed to somestandard tractable model such as truncated linear ondisparities. Within this work we open up the intriguingpossibility that the prior on shape should take in ac-count the type of scene and object we are looking at. Todo this, we provided a new formulation of the problems,a new inference method for solving this formulation anda new data set for the evaluation of our work. Evalu-ation of our work shows a dramatic improvement instereo reconstruction compared to existing approaches.We assume statistically signi�cant gain can be achievedalso for object class segmentation, but it would requiremore challenging data set. This paper has proposed a

formulation in which distributions of height maps foreach object class in road scenes are used, one mightalso easily extend this idea to the unsupervised case,with an online learning, and this extension is investi-gated in (Bleyer et al, 2011). The method can be gen-eralized to any other scenes where mutual informationbetween 3D location and object label is present andcan be learnt using discriminative methods. Further-more, it allows the incorporation of other cues com-monly used in rfs such as object-class dependent pair-wise potentials (Batra et al, 2008) or incorporation ofocclusions (Kolmogorov and Zabih, 2001) or 2nd ordersmoothness priors Woodford et al (2008) in the depthdomain. This work puts us one step closer to achievingcomplete scene understanding, and provides strong ex-perimental evidence that the joint labeling of di�erentproblems can bring substantial gains.

References

Alahari K, Russell C, Torr PHS (2010) E�cient piece-wise learning for conditional random �elds. In: Con-ference on Computer Vision and Pattern Recognition4

Batra D, Sukthankar R, Tsuhan C (2008) Learningclass-speci�c a�nities for image labelling. In: Confer-ence on Computer Vision and Pattern Recognition 4,11

Bleyer M, Rother C, Kohli P, Scharstein D, Sinha S(2011) Object stereo � joint stereo matching and ob-ject segmentation. In: Conference on Computer Vi-sion and Pattern Recognition 11

Boykov Y, Jolly M (2001) Interactive graph cuts foroptimal boundary and region segmentation of objectsin N-D images. In: International Conference on Com-puter Vision 4

Boykov Y, Kolmogorov V (2004) An ExperimentalComparison of Min-Cut/Max-Flow Algorithms forEnergy Minimization in Vision. Transactions on Pat-tern Analysis and Machine Intelligence 8

Boykov Y, Veksler O, Zabih R (2001) Fast approximateenergy minimization via graph cuts. Transactions onPattern Analysis and Machine Intelligence 2, 4, 7

Brostow GJ, Shotton J, Fauqueur J, Cipolla R (2008)Segmentation and recognition using structure frommotion point clouds. In: European Conference onComputer Vision 2, 9

Comaniciu D, Meer P (2002) Mean shift: A robust ap-proach toward feature space analysis. Transactionson Pattern Analysis and Machine Intelligence 4

Dick AR, Torr PHS, Cipolla R (2004) Modelling and in-terpretation of architecture from several images. In-ternational Journal of Computer Vision 2

Fig. 7 Monocular results. (A) Original Image. (B) Object class segmentation ground truth. (C) Proposed method Object classsegmentation result. (D) Dense stereo reconstruction ground truth. (E) Proposed method dense stereo reconstruction result. (F)Stand alone dense stereo reconstruction result (LT). The quality of reconstruction improves with the distance from the epipole. Bestviewed in color.

Fischler MA, Bolles RC (1981) Random sample consen-sus: a paradigm for model �tting with applications toimage analysis and automated cartography. Commu-nications of the ACM 5

Gould S, Fulton R, Koller D (2009) Decomposing ascene into geometric and semantically consistent re-gions. In: International Conference on Computer Vi-sion 2

Hoiem D, Efros A, Hebert M (2005) Automatic photopop-up. ACM Transactions on Graphics 2

Hoiem D, Efros A, Hebert M (2006) Putting objects inperspective. In: Conference on Computer Vision andPattern Recognition 2

Hoiem D, Rother C, Winn JM (2007) 3D layout CRFfor multi-view object class recognition and segmenta-tion. In: Conference on Computer Vision and PatternRecognition 2

Kohli P, Kumar M, Torr PHS (2007) P 3 and beyond:Solving energies with higher order cliques. In: Con-ference on Computer Vision and Pattern Recognition7

Kohli P, Ladicky L, Torr PHS (2008) Robust higherorder potentials for enforcing label consistency. In:Conference on Computer Vision and Pattern Recog-nition 4, 7

Kolmogorov V, Zabih R (2001) Computing visual corre-spondence with occlusions via graph cuts. In: ICCV9, 11

Kumar MP, Veksler O, Torr PHS (2011) Improvedmoves for truncated convex models. Journal of Ma-chine Learning Research 2, 7

Ladicky L, Russell C, Kohli P, Torr PHS (2009) As-sociative hierarchical CRFs for object class imagesegmentation. In: International Conference on Com-puter Vision 2, 4, 7, 9, 10

Leibe B, Cornelis N, Cornelis K, Gool LV (2007) Dy-namic 3D scene analysis from a moving vehicle. In:Conference on Computer Vision and Pattern Recog-nition 2, 8

Liu B, Gould S, Koller D (2010) Single image depthestimation from predicted semantic labels. In: Con-ference on Computer Vision and Pattern Recognition2

Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Journal of Com-puter Vision 5

Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E,Belongie S (2007) Objects in context. In: Interna-tional Conference on Computer Vision 4

Ramalingam S, Kohli P, Alahari K, Torr PHS (2008)Exact inference in multi-label CRFs with higher or-der cliques. In: Conference on Computer Vision andPattern Recognition 2

Rother C, Kolmogorov V, Blake A (2004) Grabcut: in-teractive foreground extraction using iterated graphcuts. In: SIGGRAPH 4

Russell C, Ladicky L, Kohli P, Torr PHS (2010) Exactand approximate inference in associative hierarchicalnetworks using graph cuts. Uncertainty in Arti�cialIntelligence 4, 8

Scharstein D, Szeliski R (2002) A taxonomy and eval-uation of dense two-frame stereo correspondence al-gorithms. International Journal of Computer Vision2, 4, 9, 10

Shi J, Malik J (2000) Normalized cuts and image seg-mentation. Transactions on Pattern Analysis andMachine Intelligence 4

Shotton J, Winn J, Rother C, Criminisi A (2006) Tex-tonBoost: Joint appearance, shape and context mod-eling for multi-class object recognition and segmen-tation. In: European Conference on Computer Vision

2, 4, 10Sturgess P, Alahari K, Ladicky L, Torr PHS (2009)Combining appearance and structure from motionfeatures for road scene understanding. In: British Ma-chine Vision Conference 2, 4, 10

Taskar B, Chatalbashev V, Koller D (2004) Learningassociative Markov networks. In: International Con-ference on Machine Learning 4

Torr PHS, Murray DW (1997) The development andcomparison of robust methods for estimating the fun-damental matrix. International Journal of ComputerVision 5

Torralba A, Murphy K, Freeman W (2004) Sharing fea-tures: e�cient boosting procedures for multiclass ob-ject detection. In: Conference on Computer Visionand Pattern Recognition 4

Tsochantaridis I, Joachims T, Hofmann T, Altun Y(2005) Large margin methods for structured and in-terdependent output variables. The Journal of Ma-chine Learning Research 4

Woodford O, Torr PHS, Reid I, Fitzgibbon A (2008)Global stereo reconstruction under second ordersmoothness priors. In: Conference on Computer Vi-sion and Pattern Recognition 11

Yotta (2011) Yotta DCL horizons.http://www.yottadcl.com/horizons/ 1

Joint Optimization for Object Class Segmentation and Dense ... · problems of object class...

Documents

Transcript of Joint Optimization for Object Class Segmentation and Dense ... · problems of object class...