Feature ArticleSimultaneous Detection and Tracking: A Cognitively Inspired Approach
Air Force Research Laboratory
Hanscom Air Force Base
Efficient target tracking is one of the key capabilities necessary for the successful performance of most core military functions. Target tracking refers to the estimation of the speed and position of an object—such as a person, a vehicle or a ship—over time using one or several sensors. The ideal type of sensor depends on the environment, and it is often beneficial to use several sensor types to extract maximum information. This introduces an additional computational burden of integrating information from multiple and diverse sensors.
Additional problems involved in target tracking include the detection and classification or identification of potential targets—the task known as track initiation. Detection is complicated by the fact that objects are rarely encountered alone in the environment. They always exist in some context, consisting of other possible targets and irrelevant objects, referred to as clutter, which can be confused with valid targets. Often information contained within a single scan of the sensor inputs is not sufficient to distinguish the target from clutter, making track initiation challenging. Multiscan detection can help resolve ambiguities, but this comes at the cost of more computations needing to be performed by the information processor.
The main reason to be concerned about the available computational resources is the data association problem. Every piece of sensor data originates from one or sometimes several objects in the environment, and the number of possible mappings between the data and the objects grows exponentially with the number of objects and the size of the data. The more sensors used for tracking, the more data associations need to be made. In the case of the multiscan approach, the data collected during different time intervals have to be associated as well.
Humans are confronted with these challenging problems on a regular basis. People are able to separate objects of interest from clutter and track them over extended time periods, and they do it in real time. The exact cognitive mechanisms are being actively researched both theoretically and experimentally, and many theories of cognition have been developed over the decades.
A cognitive-based framework, Neural Modeling Fields, developed at the Air Force Research Laboratory, addresses the issue of overcoming the computational complexity of data association. At the heart of the framework lies the cognitively inspired concept of a process of transitioning from vague to crisp associations, which is referred to as dynamic logic.
The essence of the dynamic logic approach is in combining existing knowledge about potential targets with an adaptive mechanism capable of finding a quick match between the models representing this knowledge and the data coming from the sensors. This idea is deeply rooted in psychology and neurophysiology.
The main components of the dynamic logic framework are the input data and the parametric models of various objects that are expected to be observed.
Suppose that there are N data points coming from all of the sensors and all of the time frames. The clutter and track models are defined to match the inputs, and a measure of similarity between each data point and each of the models is introduced. The models depend on parameters such as the initial position, velocity and acceleration, etc, which are determined by the algorithm. The total similarity is given by Equation 1 (shown on Page 18), where N is the total number of data points, H is the total number of models, and l(n|h) is the similarity between data point n and model h.
If the function l(n|h) is formulated in probabilistic terms, the total similarity can be interpreted as the total conditional likelihood of the data given in the models, making the framework similar to the finite mixture models in statistics. The maximization of the total similarity with respect to the model parameters provides the best match between the data and the models.
The maximization algorithm is still facing computational complexity, which can be overcome by imposing a gradual transition from vague coupling between the data and the models to more distinct coupling. This is achieved by setting the initial parameters of the models to the values resulting in roughly the same order of magnitude similarity, and consequently the association weight, between all the models and all the data points. In the process of algorithm execution, the association weights become larger or smaller as the model parameters adjust to better match the data.
The algorithm iteration number is denoted by I, the set of model parameters by Sh, and the association weights between the model h and the data element n by f(h|n). In Step 1 of the algorithm, the model parameters are initialized: SIh=S0h, for h=1...H. Steps 2, 3, and 4 are repeated until convergence criteria are satisfied.
In Step 2, the association weights between the data and the models are computed using Equation 2, for h=1...H. In Step 3, the model parameters are estimated using Equation 3. Step 3 is a form of gradient ascent with all of the derivatives weighted by the association weights from Step 2. Finally, the model specific vagueness parameters SDLhI+1 are computed in Step 4. These are the model parameters that prevent the algorithm from converging to local maxima by imposing the gradual vague-to-crisp transition.
Application to Video Detection
Practical application of the dynamic logic algorithm consists of specifying the exact form of l(n|h) for each model type. In the case of optical sensors, the target model has to describe the shape, color and texture of the object as well as its motion. The clutter model needs to adequately describe the background.
The more prior knowledge about the expected targets that is contained in the models, the better the algorithm performance will be. At the same time, it is important to realize that the models must contain adaptable parameters that can be determined only from matching the models to the actual sensor inputs. This balance between prior knowledge and adaptive learning is the key for successful operation of the algorithm.
The image at top left shows an unprocessed video frame, which was taken at 0.07 seconds. The three-image sequence shows the first, second and fifth iteration of the algorithm as it processes this frame, and the final image shows the detection of two fast-moving fish in the aquarium. Click to enlarge.
The operation of the algorithm can be tested using video sequences. In the following example, the task is to detect all fast-moving fish in an aquarium. The video sequence consists of seven frames with a size of 320 by 240 pixels. The frame rate is 15 frames per second.
The camera recorded grayscale images for these experiments. In addition to moving fish, the images also contain many plants and rocks that can be mistaken for a fish—especially when a single frame is considered. The camera is handheld, resulting in jitter and induced motion of the objects from frame to frame.
The target model is expressed as the product of three similarity functions for the object's shape, features and motion, as shown in Equation 4.
Each of the submodels is expressed using the Gaussian probability density function. The model parameters include the initial position and the velocity of the target, the size of the target and the brightness of the target. The clutter is modeled by the uniform probability density function.
The detection is split into four stages. During the first stage, the algorithm identifies blobs on each image frame, using the scale-space approach. These blobs serve as input into the second stage of the algorithm. During the second stage, the algorithm identifies possible tracks formed by the blob centers. During the third stage, possible tracks are identified based on the complete image information, using the results of the second stage as initial conditions. During the fourth and final stage of this process, the less likely tracks identified in the third stage are discarded using a threshold. The third stage also includes pruning and merging of track models. The algorithm converges after only 10 iterations during the second and the third stages, successfully detecting fast-moving targets. The detection of targets moving along nonlinear trajectories can be accomplished by introducing new model types.
From the time of Hermann von Helmholtz, it was understood that the eye itself is not a very precise instrument. Helmholtz suggested that object recognition and tracking is performed using previous experience stored in the brain. This means that in addition to the bottom-up signals coming from the retina, the brain uses top-down signals from higher brain areas to guide the process of recognition.
In the case of the dynamic logic algorithm, the accumulation and improvement of prior knowledge translates into more precise parametric models. The strength of this approach is in the universal applicability of the core algorithm, regardless of the particular application. The implementation of the algorithm allows quick replacement of model types without changing other parts of the computer code, leading to rapid application development.
For a full list of references, please contact Roman Ilin at firstname.lastname@example.org.
Dr. Roman Ilin is a research scientist at the Air Force Research Laboratory, Hanscom Air Force Base. He received his doctorate in computer science from the University of Memphis in 2008. His current research interests include automatic situation assessment, target tracking and characterization, multisensor data fusion, optimal control, reinforcement learning and artificial neural networks.