TrackFlow - How Normalizing Flows can fuse multi-modal information for Multi-Object Tracking

Introduction

In the ever-evolving field of computer vision and object tracking, researchers have been exploring various approaches to improve tracking accuracy and robustness. One of the traditional tracking methodologies, known as “tracking-by-detection,” has recently witnessed a resurgence in interest due to its simplicity and the strong priors it offers. This approach has proven effective, sparing researchers from the complexities and challenges associated with “tracking-by-attention” techniques.

In our paper titled “TrackFlow: Multi-Object Tracking with Normalizing Flows,” we delve into the world of multi-object tracking with a particular focus on multi-modal scenarios. In these scenarios, trackers must make sense of heterogeneous information, including 2D motion cues, visual appearance, pose estimates, and even rough 3D information. The critical challenge is to compute a comprehensive cost that effectively merges these diverse data sources to make informed tracking decisions.

The Complexity of Multi-Modal Tracking

Multi-modal tracking presents a unique set of challenges. Unlike traditional tracking scenarios where a single source of information may suffice, multi-modal tracking demands a more sophisticated approach. This paper addresses a case study in which a rough estimate of 3D information is available alongside other traditional metrics like Intersection over Union (IoU). The goal is to merge these sources of information intelligently.

Challenges of Existing Approaches

Traditionally, existing work [1, 2] have relied on either simple rules or complex heuristics to balance the contribution of each cost in multi-modal tracking:

\begin{equation} c_{i,j}=\lambda d^{(1)}(i,j)+(1-\lambda)d^{(2)}(i,j). \end{equation}

While these methods can yield promising results, they suffer from two significant limitations:

  1. Hyperparameter Tuning: They often require careful tuning of tailored hyperparameters on a hold-out dataset. This tuning process can be time-consuming and may not generalize well to different tracking scenarios.

  2. Assumption of Independence: These approaches assume that the costs derived from different sources are independent, which is untrue. In complex multi-modal tracking scenarios, the relationships between data sources are often intricate and interdependent.

A Probabilistic Approach to Multi-Modal Tracking

To address these challenges, we propose a novel probabilistic framework that treats the cost of a candidate association as the negative log-likelihood generated by a deep density estimator. This estimator is trained to model the conditional joint probability distribution of correct associations.

This approach leverages the power of probabilistic learning to capture the intricate relationships between diverse sources of information, allowing for a more nuanced and context-aware assessment of association costs. By using negative log-likelihood as a metric, the method avoids the need for explicit hyperparameter tuning and is more adaptive to various tracking scenarios.

Formulation. Given the track \(T\), a candidate detection D and the resulting displacements \(\Delta_{p}\), \(\Delta_{w,h}\), and \(\Delta_{d}\), we define the fusing cost \(\Phi(T, D)\) as the negative log-likelihood: \(\Phi(T, D) = -\log \mathcal{P}_{\theta}(D \in T \mid T).\)

We apply Maximum Likelihood Estimation (MLE) and learn a deep generative model \(f ([\Delta_{p}, \Delta_{w,h}, \Delta_d] ∣ T, \theta)\) promoting the likelihood of correct associations.

The design of \(f(\cdot \mid T, \theta)\) derives from normalizing flow models, which create an invertible mapping between a tractable base distribution and an arbitrary complex one (See Figure above).

Experimental Validation

The effectiveness of this probabilistic approach was rigorously tested through experiments conducted on both simulated and real-world tracking benchmarks. The results consistently demonstrated significant improvements in the performance of several tracking-by-detection algorithms.

Conclusion

In the world of multi-object tracking, where data comes in various forms and sources, TrackFlow brings a refreshing perspective. By adopting a probabilistic framework that accounts for the interdependencies among different sources of information, we have showcased a promising way forward in improving tracking accuracy and robustness.

This research addresses the limitations of existing approaches and opens up new possibilities for developing more adaptive and versatile multi-modal tracking systems. As computer vision continues to play a crucial role in various applications, from autonomous vehicles to surveillance systems, the insights from this paper have the potential to significantly impact the field, paving the way for more reliable and efficient multi-object tracking solutions.

References

[1] N. Wojke et al. Simple online and realtime tracking with a deep association metric. In ICIP, 2017.

[2] J. Rajasegaran et al. Tracking people by predicting 3d appearance, location and pose. In CVPR, 2022