The article below presents the results of one of the parts of my master thesis. The purpose of the thesis was to research methods that are applicable for move tracking and gesture recognition and a try of building a working system. Firstly the idea of the study was to apply gathered knowledge for sign-language translation. Unfortunately the lack of time (4 months) and hardware limitation allowed only for proof of concept’s implementation. Although the system is good enough for human-computer interaction.
Unfortunately the real world is to complicated to be described well within a computer application. A common technique while implementing complex systems is to make assumptions that simplify the real problem to a simpler one. Also in this case I made some simplification that limited users behavior but allowed me to build a working system within reasonable time. Therefore the main assumption I did:
– Only one object moves at time.
– We consider only moves in 2D plane.
– Constant source of light.
– Camera doesn’t move.
The first assumption explains by itself. Just to be precise, having just one moving objects removes the necessity of deciding what is the object that actually needs to be traced, or tracing numbers of objects. The second one tells that depth is not considered. A very important assumptions was the last one. As the segmentation was colored based and I was working only with a very color sensitive web-camera I couldn’t allow on working with sun-light. Even little changes of the light angle or saturation changes the color perceived by the camera too much.
The assumptions are just some abstractions that are used mainly for simplifying the image-processing part and they could be chosen freely as long as they don’t violate the system specification. The main point of them is to skip some of the problems that are not the crucial part of the system, and concentrate on what really matters.
The method I decided to use for segmentation uses the colors histogram approach. Before tracing starts, there is one extra preprocessing step (you may check it on the video), where I select the area to take a sample of colors to trace. The color value appearing more often in the sample has larger value in the histogram. Then I propagate back the histogram to the image, but this time in the grayscale. The bigger value of the color, the more white it becomes on the output image (the other window on the left). It is worth mentioning that in this case of original image I used HSV model rather than RGB.
Picture 1. Click to see the movie presenting segmentation.
The second step of the segmentation is to abandon elements that are not interesting for the application. Even if we made an assumption that there is only one object moving at time, it’s almost impossible to keep your head not moving at all. It introduces some problems as it has similar color histogram as hands and may confusing the tracing algorithm. Moreover using complex head/face detection approaches using boosting approaches didn’t make sense as they are computationally expensive and I simply didn’t need them. Nevertheless the head was not the only problem. There could be some other object with matching color histogram in the area. Luckily the solution described below solved the issue.
What was done in this case, based on the observation that heads moves way slower than hands (in the best case doesn’t move at all). Thus the consecutive frames are compared and pixels that don’t change much are abandoned. The comparison is done by substracting pixels coming from different frames and comparing them to the threshold defined on the base of experiments.
The tracing algorithm has to met some requirements that are coming directly from the nature of the moves: they don’t have to be continues and linear. This makes the tracing problem a little bit more complicated as in case of instant change of the direction, the tracing algorithm has to notice it quickly and not to loose the followed object. Another thing is that the tracing object might be occluded by some other objects. In our case this problem occurs when we move hand above head. We have eliminated this problem partially in the previous step, but it was not sufficient – we managed not to trace the head, but there were difficulties of losing the hand when it appeared over the head.
Picture 2. Click to see the movie presenting tracing.
After the SLR (Systematic Literature Review) as the first part of the thesis, I decided to apply two strategies described in the literature: Meanshift algorithm for tracing and Kalman Filter for predicting the position of the hand in the next frame.
Meanshift algorithm has been applied since 1998 in tracking problems successfully. The formal basis of the method was described by Fukunaga and Comaniciu at . However the main idea might be described as follows:
1. Choose the window
a) set the initial position of the window
b) choose the density function type
c) choose the shape of the window
d) choose the size of the window.
2. Count the center of mass.
3. Move the window to the center of mass.
4. Move back to point 2.
As we can expect I associated the back-propagation method with the Meanshift algorithm – the window is moved to the point that is the center of the window, that suits the color histogram the most.
Kalman Filter was firstly described in 1960. The precise explanation of the method might be found in Welsh and Bishop  paper. The point of the approach is to keep the information about previous measures and maximize the a-posteriori probability of the position’s prediction. But instead of keeping a long list of previous measures we are creating a model that is iteratively updated.
Picture 3. The shape recognition step. Click to watch the movie.
The representation of a gesture is a sequence of directions and shapes. A direction is represented using the idea taken from the compass rose with the eight principal winds, except the winds are vectors outgoing from the center. Every direction has assigned a number from 1 to 8 and describes the direction of the last move.
The shape is one of the predefined shapes of the hand. The system gathers the contours of the hand and uses a SVM (Support Vector Machine) classifier to obtain related label. The contours were represented using Freeman Chain Codes.
Finally a gesture could look like a sequence of pairs [1,’A’], [1, ‘B’] , [2, ‘B’], [, ‘STOP’]. Notice that when the ‘STOP’ shape is met, the direction doesn’t matter.
Hidden Markov Model has been applied for gesture recognition. It was trained using some set of predefined actions that were then translated to a language word. Every sequence of movements was ending with a special move meaning “end of gesture”. It simplified the processing a lot, but it is not a very good practice and “don’t try it at home” 😉 The results might be seen on the last video.
The training was split into two parts. Firstly I was learning SVM just for the shape recognition. When it was done, I approached the second step of learning the HMMs using outputs from SVMto describe the shape.
Even if the system I implemented was very limited, it proved the concepts of the features choice and the tracing algorithms as strong enough for a human-computer interaction. Nevertheless the segmentation method has to be improved as the current one is a home-made solution and the assumptions it takes are too strong. Secondly the language processing shouldn’t require the “stop word” and should be done fully dynamically.
. K. Fukunaga, “Introduction to Statistical Pattern Recognition”, Boston, Academic Press, 1990
. G. Welsh, G. Bishop “An introduction to Kalman filter” (Technical Report TR95-041), University of North Carolina, Chapel Hill, NC, 1995
. D. Comaniciu and P. Meer, “Mean shift analysis and applications”, IEEE Internation Conference on Computer Vision (vol 2, p. 1197), 1999.