Codes and Musings of Sanyam Garg

Monday, 18 January 2016

Load Testing Project Oxford's API

Last year at Build 2015, Microsoft announced a set of pre-trained computer vision REST APIs and SDK for .NET for writing applications that utilize AI. The APIs were introduced to easily make sense of massive multimedia data while dealing with images, videos and voice.

Let's consider an example of an android app that detects and labels images in your phone gallery based on your friends that are present in a particular image. To build such an app with client side face detection and recognition APIs, you need to write complex computer vision methods to detect face (probably implement the Haar cascade in OpenCV, which by the way only detects frontal faces, or you can look into flandmark library which has it's complications while using) followed by face alignment for face recognition. You would further need to train a classifier for recognizing the faces. This is tiresome for a developer who might not have the time, expertise or infrastructure for writing these methods.

Instead, Microsoft's Project Oxford API presents a simple one line code for face detection, verification, grouping and identification. The API simply uploads the image and the detection results are returned.

using (Stream imageFileStream = File.OpenRead(imageFilePath))
{
    var faces = await faceServiceClient.DetectAsync(imageFileStream);    
}

A great documentation on how to begin using the APIs in C# and Android can be found on their official website: Face API Documentation

The catch?

You need a constant internet connection for uploading the images whenever you need to use the APIs. Further, an azure subscription is required and you will have to pay in case you exceed the transaction limit.

In this blog, I am going to explain how to make multiple calls for face detection and also explain how you can load test the APIs for performance testing. You should first begin by registering on the official website using your Live ID, which will allow you upto 30,000 transactions per month which is sufficient for testing purposes. Before performance testing the APIs, I would advice everyone to code the detection method themselves as explained in the official documentation. The code is really easy! You can also checkout my code from github: ProjectOxfordLoadTest

To load test the APIs, I collected images using the publicly available datasets such as BioID database, Caltech Frontal face Database and the ORL Face database. Following is a simple Unit Test for testing the API.

[TestMethod]
public async Task TestFaceDetection()
{
    String filePath = "D:\Face Detection Databases\CaltechFaces\image.jpg";

    //Detect faces in the selected image
    FaceRectangle[] faceRects = await UploadAndDetectFaces(filePath);

    Assert.IsTrue(faceRects.Length > 0); //All images contain detectable frontal faces
}

private async Task<FaceRectangle[]> UploadAndDetectFaces(string imageFilePath)
{
    try
    {
        using (Stream imageFileStream = File.OpenRead(imageFilePath))
        {
            var faces = await faceServiceClient.DetectAsync(imageFileStream);
            var faceRects = faces.Select(face = > face.FaceRectangle);
            return faceRects.ToArray();
        }
    }
    catch (Exception)
    {
        return new FaceRectangle[0];
    }
}

A comprehensive documentation for creating a Web Performance and Load Test Project in Visual Studio is available on MSDN, which can help all beginners to load the test environment and add a Load Test.

Load test is basically created to check how the API performs under stress from parallel calls. The Load Test project (explained above) can automatically create parallel loads for testing and the number of calls can be varied in the properties. I randomly chose images from the database folder so that every parallel call can have different images for testing as follows:

public async Task TestFaceDetection()
{

    //Extract all image files' location from a given folder
    String searchFolder = @"D:\Face Detection Databases\CaltechFaces";
    var filters = new String[] { "jpg", "jpeg", "png", "gif", "tiff", "bmp" };
    var files = GetFilesFromDirectory(searchFolder, filters, true);
    int numberOfFiles = files.Length;

    //Return a random image location 
    Random rnd = new Random();
    int randomImage = rnd.Next(0, numberOfFiles);
    string filePath = files[randomImage];

    //Detect faces in the selected image
    FaceRectangle[] faceRects = await UploadAndDetectFaces(filePath);
 
    Assert.IsTrue(faceRects.Length > 0); //All images contain detectable frontal faces
}

Further, the above Test Method will provide the performance timings for running the entire TestFaceDetection() method, hence we define a timing context around the UploadAndDetectFaces(string) method so that we can accurately measure the time taken by the call to face detection API. A TestContext method is included in your test class as described below:

[TestClass]
public class UnitTest1
{
    public TestContext TestContext
    {
        get{ return context; }
        set{ context = value; }
    }
    private TestContext context;
    
    public async Task TestFaceDetection()
    {
        this.context = TestContext;
        
        //Extract all image files' location from a given folder
        String searchFolder = @"D:\Face Detection Databases\CaltechFaces";
        var filters = new String[] { "jpg", "jpeg", "png", "gif", "tiff", "bmp" };
        var files = GetFilesFromDirectory(searchFolder, filters, true);
        int numberOfFiles = files.Length;
        
        //Return a random image location 
        Random rnd = new Random();
        int randomImage = rnd.Next(0, numberOfFiles);
        string filePath = files[randomImage];
        
        if (context.Properties.Contains("$LoadTestUserContext")) //Begin timing load test
         context.BeginTimer("MyTimerFaceDetection");
        
        //Detect faces in the selected image
        FaceRectangle[] faceRects = await UploadAndDetectFaces(filePath);
        
        if (context.Properties.Contains("$LoadTestUserContext")) //End timing load test
         context.EndTimer("MyTimerFaceDetection");
        
        Assert.IsTrue(faceRects.Length > 0); //All images contain detectable frontal faces
    }
}

Once you run your load test, you can vary the parallel load in the properties. This can help you evaluate how the API performs under stress and how many tests are performed per second in your present net bandwidth. The time taken for returning the detected faces will depend upon the time taken to upload the image pixels to the buffer and hence the performance evaluations will depend upon the internet bandwidth.

If you have been able to successfully create and run the load test, you will be able to visualize the performance graphs of the API. You can also visit my github repo of the project at: ProjectOxfordLoatTest.

For a constant load of just one user and a test time of 3:00 minutes, a total of 184 calls were made to the server with an average response time of 0.93 seconds. This response time was for the images in the ORL Database which hardly have a size of 10 KB per image. 95% of the calls were responded to within a 1.84 seconds!

Performance graph using images from ORL Database (~ 10 KB)

Further, for the images in BioID Database which contains images of approximately 65 KB size, only a total of 141 calls were made in 3:00 minutes. This yields an average response time of 1.14 seconds, which amounts to 0.87 FPS.

Performance graph using images from BioID Database (~ 65 KB)

Thus, Project Oxford's Face API definitely appear to be a good alternative if you have to detect faces, head pose and emotion, they are certainly not suitable for real time processing. The exposed APIs do implement the current state-of-the-art methods for identification, tracking and recognition and would be my first choice if I have to build a computer vision app that can benefit from server-side processing.

Thursday, 22 October 2015

A blink detection technique using combination of eye detection cascades

During my recent internship at Microsoft Research, I worked on a blink detection system that could detect the blinks of a person when they are working on their computers. A working webcam connection was of course required for this. For all those who are beginning with OpenCV, the documentation on face and eye detection can be found here: OpenCV (C++): Face & Eye Detection

Many initial thoughts were implemented that may help in detecting a person’s blink:

1. We can detect the face using Voila Jones’s method and subtract consecutive frames in the input video stream. Blinks can be highlighted using this method only when the user does not move his head at all. Slight change in the head pose can completely ruin the results and is this method is only suitable when no head motion is present. Infact, similar techniques [Method 1, Method 2] have already been proposed for people with severe disabilities such as ALS.

2. Detect both the face and the eyes using Voila Jones’s method. Subtracting consecutive frames of the only the eye regions in the video followed by thresholding can detect motion changes when a person blinks. However, this also produces an output when the user moves the iris without blinking. Hence, this method was also rejected.

3. We can threshold the eye regions and then calculate the number of pixels in the segmented image. More segmented pixels might indicate an open eye, whereas a low count can signify a closed eye. This is because the dark iris in the open eye can significantly increase the count of thresholded pixels. Such a technique is prone to changes in ambient light and dark shadows.

4. Adaptive threshold based on cumulative histogram of the eye region always yields a segmented image irrespective of whether the eye is open or closed and hence, it was not used as well.

5. Using pre-learnt cascades

The problem of blink detection can be broken down into two simpler problems of detecting whether the eye is open or closed using pre-learnt Haar cascades available in OpenCV. It was crucial to know which cascades detect open eyes and which cascades detect closed eyes. After an extensive literature survey, it was found that haarcascade_eye_tree_eyeglasses.xml can detect eyes in both open and closed states, while haarcascade_lefteye_2splits.xml and haarcascade_righteye_2splits.xml detect eyes only when they are open.

It is advisable to run the eye detector only in the region where the eye maybe present. Thus, eye detection is always followed after face detection and almost never performed individually. Separate detectors are initialized for the left and right eye using their respective cascades. Left eye detector is run on the right half of the detected face and right eye detector is run on the left half of the detected face. Although, face detection is performed on a size reduced image, eye detection is performed on the actual image because of its small size.

Following are some of the eye detection cascades that are available:

Cascade Classifier	Reliability	Speed	Eyes Found	Glasses
haarcascade_mcs_lefteye.xml	80%	18 msec	Open or closed	No
haarcascade_lefteye_2splits.xml	60%	7 msec	Open or closed	No
haarcascade_eye.xml	40%	5 msec	Open only	No
haarcascade_eye_tree_eyeglasses.xml	15%	10 msec	Open only	Yes

We begin by scanning the the left side of the face with haarcascade_eye_tree_eyeglasses.xml haar cascade which outputs an eye location irrespective of whether the eye is open or closed. Further, the left side of face is again scanned using haarcascade_lefteye_2splits.xml which detects only open eyes. Thus,

a) If the eye is detected using both the detectors, it implies that the eye is open.
b) If the eye is detected by just the initial detector, it implies that the eye is closed.
c) If none of the detectors detect the eye, it may imply that eye was not found, probably due to occlusions, dark shadows or specular reflections on eyeglasses.

Following is a sample example of an eye detection code written in C++ using OpenCV library which can detect if the eye is open or closed. Based on change in the state of the eye, the blinks can be detected.

vector<Rect> eyesRight = storeLeftEyePos(topRightOfFace); //Detect Open or Closed eyes

if (eyesRight.size() > 0)

{

// Now look for open eyes only

vector<Rect> eyesRightNew = storeLeftEyePos_open(topRightOfFace);

if (eyesRightNew.size() > 0) //Eye is open

{

//..

}

else //Eye is closed

{

//..

}

// Method for detecting open and closed eyes in right half of face

vector<Rect> storeLeftEyePos(Mat rightFaceImage)

{

vector<Rect> eyes;

leftEyeDetector.detectMultiScale(

rightFaceImage,

eyes,

1.1,

CASCADE_FIND_BIGGEST_OBJECT,

Size(0, 0)

);

return eyes;

}

// Method for detecting open eyes in right half of face

vector<Rect> storeLeftEyePos_open(Mat rightFaceImage)

{

vector<Rect> eyes;

leftEyeDetector_open.detectMultiScale(

rightFaceImage,

eyes,

1.1,

CASCADE_FIND_BIGGEST_OBJECT,

Size(0, 0)

);

return eyes;

}

//Loading the cascades

string leftEyeCascadeFilename =

"C:\\opencv\\sources\\data\\haarcascades_cuda\\haarcascade_lefteye_2splits.xml";

leftEyeDetector.load(leftEyeCascadeFilename);

string leftEye_open_CascadeFilename =

"C:\\opencv\\sources\\data\\haarcascades_cuda\\haarcascade_eye_tree_eyeglasses.xml";

leftEyeDetector_open.load(leftEye_open_CascadeFilename);

The methodology was tested on the Talking Face Video which is a 200 seconds video recording of a person engaged in a conversation. The following images demonstrate the effectiveness of the method described above, where a green box implies that an open eye has been detected, while a red bounding box is drawn when a closed eye is detected.

The complete video can be seen here:

In future, using a method that outputs the Percentage Eye Close (PERCLOS) value of each eye could yield an even better estimate of the state of the eye.

Sunday, 3 August 2014

An Appearance Based Techinque for Detecting Obstacles

Majority of the upcoming smart cars are employing safety techniques that detect pedestrians, obstacles, vehicles in the path. For example, the car detection algorithm employed by Vicomtech (Video Link) tracks cars in front and calculates collision time. Other than this, an obstacle detection system can be incorporated in smartphones which can be used by the blind for safely navigating an indoor environment.

During my recent internship at Soliton Technologies, Bangalore, known for its work on smart cameras and factory automation using vision systems, I worked on developing a system that can be used for detecting obstacles in a video input from a monocular camera and for segmenting walk-able floor region. The technique may be used for assisting the blind or for moving robots indoors.

We majorly concentrated on using color cues for finding the floor region. Once the floor region is detected, we simply highlight all other regions in an image as obstacles. Our technique was inspired from the research work of Iwan Ulrich and Illah Nourbakhsh, in their paper “Appearance-Based Obstacle Detection with Monocular Color Vision” (Paper Link).

We begin by assuming the bottom most region of an image to contain the floor region as shown. The region inside the red rectangle is taken as reference region that contains the floor.

Figure 1: Marked rectangle (in red) is assumed to contain floor region.

In the next step, we calculate the histogram of this region, and for every pixel in the entire image, we find the probability of that pixel being similar to pixels in the red rectangle region (based on naive bayes theorem). Interested viewers can look into the "backprojection" method implemented in OpenCV.

Finally, All pixels dissimilar in color to the pixels in the rectangular region are highlighted as obstacles.

The first approach was based on comparing the RGB values of the pixels. Pixels having high probability of belonging to the histogram of the RGB image inside the rectangular region were considered as non-obstacles.

Figure 2: The pixels which did not match the pixels
inside the rectangular region are highlighted in red

To get a cleaner result, we modify the algorithm to consider larger non-obstacle area for better histogram estimate.

Figure 3: Improved detection results on the same video frame

This technique still suffered from some major drawbacks: the floor regions suffering from specular reflections (due to shining light of the surface) and floor regions containing shadows are still highlighted as obstacles.

Hence, we investigated the same algorithm, but using the HSV model as it can handle the cases of specular reflection better. The technique is similar to the Mean-Shift algorithm used for tracking moving objects.

Figure 4: Obstacles highlighted using HSV model

The HSV model handles cases of specular reflection and shadows well, but fails at many locations. Specially regions with darker intensities are not detected as obstacle.

We switched back to the RGB model and find regions of specular reflection (based on intensity and gradient magnitude) and mark them as non-obstacles and regions of shadows are blurred to soften their effects. A technique for storing and retrieving previous histograms was implemented for better performance. The algorithm was implemented on various video in both indoor and outdoor settings. Following were results:

Figure 5 a): Results in outdoor setting

Figure 5 b): Results in Indoor setting

Figure 5 c): Results in Indoor setting

A segmentation based technique was also tried but was rejected owing to the huge computation time. The algorithm is currently being converted into a viable product that can be used for navigation using only a video input and no other sensors.

My sincere advice for people looking for work in the field of Computer Vision to definitely get in touch with the folks working at Solion Technologies for its awesome start-up like culture and great office environment.

Friday, 4 July 2014

Renovating Glucometers

A great chunk of today's biological research focuses on improving existing technologies. For example, using image processing for cell counting in fluorescent stained leaf stem, or finding techniques for efficient computational methods for studying molecular, structural or cellular biology. Biologists today are also going the smart way by using smartphones as a computing platform. EyeNetra (Link: www.eyenetra.com) , a handheld device that integrates with a smartphone and provides vision correction technology to the masses is a great example.

A similar research is being conducted to replace the tradition glucometers with handheld devices wherein simple image capture of the blood impregnated strip will deduce the glucose content in blood.

Microfluidic paper when saturated with blood yield varied intensities of color where the intensity of the color developed is inversely proportional to the glucose content in blood. Most of us will be familiar with Accu-Chek strips that are used by diabetic patients to test their insulin content. In the particular case of Accu-Chek strips, a greenish color develops when saturated with human blood. The luminosity of the color developed is higher for lesser glucose content (Simply said, higher the glucose concentration, darker is the color developed).

Such is the plot obtained for glucose concentration vs. the color developed on the Accu-Chek strips:

We found that the glucose concentration correlated with the luminosity of color developed better as compared to various other combinations of the red, green or blue components.

We imagined an app that could capture the image of the strip and find the luminosity of the circular region on the strip. A person could thus find his/her blood glucose level without the use of traditional glucometers. This smartphone based app can also suggests remedies and digitally transmit the result along with the phone‘s location to a central database. This can be used to estimate demographics consisting of people with abnormal glucose levels.

A technique was required for segmenting the circular disk in the Accu-Chek strips. We used snakuscules for their ability to segment circular contours. Following is the sequence in which a snakuscule captures circular disk on the strip.

A patient capturing the strip's image may be present in different environment settings and under varied ambient lighting conditions. Although the color produced on the strip will be similar for similar glucose levels, but it may be perceived as different due to presence of external ambient lighting. It was thus necessary to normalize the colors using a color constancy algorithm.

Von-Kries coffecient Law for color constancy gave the following results, where 2 different strips with same glucose content, but in different ambient lighting were converted into images with similar colors.

a) b) c) d)

RGB Correction: Von-Kries Coefficient Law can also be used to add illumination to dark images. Figure (b) and (d) are obtained from figure (a) and (c) respectively.

After normalizing the illumination and segmenting the disk, we calculated the luminosity of the color developed by averaging the luminosity of all individual pixels inside the disk. A final equation was plotted for glucose concentration vs. luminosity (Complete data not shown).

The above strategy can be used for finding the glucose concentration of any strip using the information about the color developed upon blood impregnation. A smartphone app developed using such an algorithm can greatly benefit third world countries where in low cost, portable devices can benefit the masses. With more than 347 million people suffering from diabetes worldwide, such technology can be efficiently used for getting quick and reliable results.

Thursday, 1 May 2014

Pedestrian Detection: Why Dalal and Triggs are the godfathers of today's computer vision family!

Detecting objects in an image has always been the hot trend among the computer vision enthusiasts. What initially began as a task of detecting a single object in an image has today extended to large scale competitions that utilize millions of images for training classifiers that can detect more than a hundred categories of objects in a single image. For example, the ILSVRC2014 (ImageNet Large Scale Visual Recognition Challenge) that dare competitors to detect up-to 200 object categories in a single image.

Lowe's SIFT (Scale Invariant Feature Transform) was one of the earliest attempt at matching objects in an unknown image with that of the training image. SIFT, although still considered the best method for object detection fails when an interesting object suffers from in-class variation. An alternative was suggested by Dalal and Triggs in their seminal research work on human detection: "Histogram of Oriented Gradients for Human Detection". The original paper can be found here.

The paper describes an algorithm that can handle the variation in human postures, differently colored clothing, and viewing angle while detecting human figures in an image. To say it simply, the algorithm could identify humans (or any other object) irrespective of its posture and color variation. Here I explain the implementation in detail.

Creating the HOG feature descriptor

The authors compute weighted histograms of gradient orientations over small spatial neighborhoods, gather these neighboring histograms into local groups and contrast normalize them.

Following are the steps:

a) Compute centered horizontal and vertical gradients with no smoothing.

b) Compute gradient orientation and magnitudes.

For color image, pick the color channel with the highest gradient magnitude for each pixel.

c) For a 64x128 image,

Divide the image into 16x16 blocks of 50% overlap. (7x15 = 105 blocks in total)
Each block should consist of 2x2 cells with 8x8

d) Quantize the gradient orientation into 9 bins

The vote is the gradient magnitude
Interpolate votes tri-linearly between neighbouring bin center
The vote can also be weighted by a Gaussian to downweight the pixels near the edge of the block

e) Concatenate histograms (Feature Dimension: 105x4x9 = 3,780)

The entire technique was summarized nicely in a lecture by Dr. Mubarak Shah (Professor University of California, Florida)

Training Methodology

We construct a SVM classifier using positive images (containing human figures) and negative images (no human figures) using the INRIA dataset. All the images (positive and negative were resized to 128x64 pixel size and HOG feature descriptors were computed for each one of them. The images were fed into the classifier and trained using supervised learning.

Choosing the Training Dataset

The INRIA dataset (webpage link) was constructed which contained 1800 pedestrian images, in diverse environments, lighting conditions and large range of poses and backgrounds. The INRIA dataset is much more challenging then the initially used MIT pedestrian dataset.

For training 1208 128x64 size positive images of humans were taken, all cropped from a varied set of photos.

Similarly, 1218 negative images were taken containing no human figures.

Sliding Window Approach

The image is scanned at all scales and positions. Initially windows are extracted at the lowest scale i.e. 128x64 size and then increased every time by a ratio of 1.05. HOG is computed for every part of the image inside the detection window and fed into the classifier.

Results

Some result as obtained after non-maximal suppression of the detected windows.

Once the required dataset is provided, the above algorithm can also used for detecting interest objects other than human figures (e.g. Cars and motorbikes). Algorithm can handle the situation of in-class variation along with efficient performance. The HOG descriptor suggested by Dalal and Triggs today is at the frontiers of object recognition systems.

Tuesday, 11 March 2014

Snakuscules

Recently got familiar with a methodology for contour segmentation used in biomedical image processing. Researchers at the 'Biomedical Imaging Group' of École polytechnique fédérale de Lausanne (EPFL), worked on segmenting approximate circular regions in images using the concepts of snakes, known as Snakuscule.

A snakuscule is a simple active counter that preys upon bright blobs in an image. Here is my implementation:

Snakuscule enveloping a bright blob in an image

Such active contours move under the influence of energy gradients. For every snakuscule, the energy difference between the outer adjoining annulus and the inner disk is calculated. Such active contours can be programmed to prey upon bright blobs in an image. Hence, they move in the direction of decreasing energy difference.

Energy to be minimized can be given by the equation:

To minimize energy, and to detect brighter blobs, the snakuscule can move in any of the four directions, or vary its radius. Out of these six possible actions, snakuscule selects the one that maximizes the decrease in energy.

We normalize the energy function by dividing the energy of outer annulus and inner disk by the area constituting them respectively.

Normalized form:

A more rigorous implementation can be seen in the original paper on snakuscules by Philippe Thévenaz and Michael Unser.

Matlab codes for the same can be found at: github.com/sanyamgarg93/Snakuscules