Thursday 1 May 2014

Pedestrian Detection: Why Dalal and Triggs are the godfathers of today's computer vision family!

Detecting objects in an image has always been the hot trend among the computer vision enthusiasts. What initially began as a task of detecting a single object in an image has today extended to large scale competitions that utilize millions of images for training classifiers that can detect more than a hundred categories of objects in a single image. For example, the ILSVRC2014 (ImageNet Large Scale Visual Recognition Challenge) that dare competitors to detect up-to 200 object categories in a single image.

Lowe's SIFT (Scale Invariant Feature Transform) was one of the earliest attempt at matching objects in an unknown image with that of the training image. SIFT, although still considered the best method for object detection fails when an interesting object suffers from in-class variation. An alternative was suggested by Dalal and Triggs in their seminal research work on human detection: "Histogram of Oriented Gradients for Human Detection". The original paper can be found here.

The paper describes an algorithm that can handle the variation in human postures, differently colored clothing, and viewing angle while detecting human figures in an image. To say it simply, the algorithm could identify humans (or any other object) irrespective of its posture and color variation. Here I explain the implementation in detail.

Creating the HOG feature descriptor

The authors compute weighted histograms of gradient orientations over small spatial neighborhoods, gather these neighboring histograms into local groups and contrast normalize them. 

Following are the steps: 

a) Compute centered horizontal and vertical gradients with no smoothing.
b) Compute gradient orientation and magnitudes. 

  • For color image, pick the color channel with the highest gradient magnitude for each pixel.

c) For a 64x128 image,

  • Divide the image into 16x16 blocks of 50% overlap.  (7x15 = 105 blocks in total)
  • Each block should consist of 2x2 cells with 8x8

d) Quantize the gradient orientation into 9 bins

  • The vote is the gradient magnitude
  • Interpolate votes tri-linearly between neighbouring bin center
  • The vote can also be weighted by a Gaussian to downweight the pixels near the edge of the block
e) Concatenate histograms (Feature Dimension: 105x4x9 = 3,780)


The entire technique was summarized nicely in a lecture by Dr. Mubarak Shah (Professor University of California, Florida)



Training Methodology

We construct a SVM classifier using positive images (containing human figures) and negative images (no human figures) using the INRIA dataset. All the images (positive and negative were resized to 128x64 pixel size and HOG feature descriptors were computed for each one of them. The images were fed into the classifier and trained using supervised learning. 


Choosing the Training Dataset 

The INRIA dataset (webpage link) was constructed which contained 1800 pedestrian images, in diverse environments, lighting conditions and large range of poses and backgrounds. The INRIA dataset is much more challenging then the initially used MIT pedestrian dataset.

For training 1208 128x64 size positive images of humans were taken, all cropped from a varied set of photos.


Similarly, 1218 negative images were taken containing no human figures.


Sliding Window Approach

The image is scanned at all scales and positions. Initially windows are extracted at the lowest scale i.e. 128x64 size and then increased every time by a ratio of 1.05. HOG is computed for every part of the image inside the detection window and fed into the classifier.


Results 

Some result as obtained after non-maximal suppression of the detected windows. 







Once the required dataset is provided, the above algorithm can also used for detecting interest objects other than human figures (e.g. Cars and motorbikes). Algorithm can handle the situation of in-class variation along with efficient performance. The HOG descriptor suggested by Dalal and Triggs today is at the frontiers of object recognition systems.