Visual Odometry Using the KITTI Vision Dataset

Python | Visual Odometry | Computer Vision | Feature Detection | Autonomous Navigation

Project Overview

In this project, I built a visual odometry system designed specifically for the KITTI dataset, using stereo camera inputs to accurately estimate a vehicle’s trajectory. Implemented in Python, the system processes stereo images to reconstruct the vehicle’s path. The main goal was to understand and apply the principles of visual odometry while ensuring the system is robust and accurate for real-world applications.

Raw input image from one of the cameras

Data Loading

The visual odometry pipeline begins with the loading of data from the KITTI dataset using the Dataset_Handler class. This class plays a crucial role in organizing the sequence data and preparing it for further processing. The following key components are accessed and utilized:

Calibration Parameters: These include the projection matrices for both the left and right cameras (P0, P1, P2, P3). These matrices are essential for converting 3D world coordinates into 2D image coordinates, which is fundamental to the stereo vision system.
Ground Truth Poses: The true vehicle poses for each frame, which are later used to evaluate the accuracy of the visual odometry algorithm.
Image Sequences: Synchronized image sequences from the left and right cameras are loaded. These sequences are the primary input for the subsequent stages of the visual odometry pipeline.

Stereo Depth Estimation

The next step involves estimating the depth of the scene captured by the stereo cameras. This is achieved by computing a disparity map between the left and right images. The disparity map highlights the difference in pixel positions between the two images, which correlates to depth information:

Disparity Map Computation: The project supports both Stereo Block Matching (StereoBM) and Semi-Global Block Matching (StereoSGBM) algorithms for calculating the disparity map. StereoBM is a more traditional approach, while StereoSGBM provides improved accuracy by considering pixel neighborhoods.
Depth Map Calculation: Once the disparity map is generated, it is converted into a depth map using the formula:

$$\text{Depth} = \frac{f \cdot B}{\text{Disparity}}$$

where $f$

is the focal length and

B

$B$ is the baseline distance between the two cameras. The depth map is a crucial intermediate result that provides 3D information about the environment, which is essential for accurate motion estimation.

Feature Handling

Feature handling is a critical component of visual odometry, as it involves identifying and matching key points between consecutive frames:

Feature Extraction: The project utilizes the SIFT (Scale-Invariant Feature Transform) algorithm to detect features in the left image of each frame. SIFT is known for its robustness in detecting distinctive points in images that remain consistent across different scales and rotations.
Feature Matching: After extracting the features, they are matched across consecutive frames using the Brute Force matcher. The matcher pairs features based on their descriptors, with a focus on minimizing the distance between matched pairs.
Filtering Matches: To improve the reliability of the matches, a distance ratio test is applied. This test filters out weak matches, retaining only those that have a clear distinction between the best and second-best matches, thereby reducing the likelihood of incorrect correspondences.

Motion Estimation

Once the features are matched, the relative motion between consecutive frames is estimated:

PnP with RANSAC: The Perspective-n-Point (PnP) algorithm is used to estimate the camera pose (rotation and translation) between frames. PnP requires the 2D-3D correspondences provided by the matched features and the depth information from the stereo cameras.
Robust Estimation: To ensure that the motion estimation is robust against outliers, RANSAC (Random Sample Consensus) is employed. RANSAC iteratively selects random subsets of the data to fit the model, ensuring that the final pose estimation is not skewed by incorrect matches.

Trajectory Reconstruction & Visualization

With the motion between frames estimated, the full trajectory of the vehicle is reconstructed:

Frame-to-Frame Transformation: The relative transformations (rotation and translation) between each pair of consecutive frames are accumulated. This accumulation process constructs the camera’s trajectory through space, providing a complete path from the start to the end of the sequence.
Full Trajectory Reconstruction: The final trajectory is a composite of all the individual transformations, representing the vehicle’s movement over time. This trajectory can be compared against the ground truth poses to evaluate the accuracy of the visual odometry system.

Visualization plays a vital role in understanding and evaluating the performance of the visual odometry system:

2D Trajectory Plot: The estimated trajectory is plotted against the ground truth in a 2D space. This visualization helps in assessing how closely the estimated path follows the actual path.
Intermediate Result Visualization: The project also includes the ability to visualize intermediate results such as the disparity map, depth map, and feature matches. These visualizations provide insight into the intermediate steps of the pipeline, allowing for a better understanding of the process and identification of potential areas for improvement.

2D path reconstructions from stereo image data of two KITTI dataset sequences

In the trajectory reconstruction process, it’s common to encounter some inaccuracies, particularly in the form of drift over time. This drift occurs because small errors in estimating the camera’s motion accumulate with each frame. Factors contributing to this include imperfect feature matching, noise in the stereo depth estimation, and limitations in the PnP and RANSAC algorithms, especially when handling challenging scenes with repetitive textures or low contrast. Since the system relies on frame-to-frame transformations, any minor errors can propagate through the trajectory, leading to deviations from the true path. This cumulative effect is a key reason why the position estimate may drift, causing the reconstructed trajectory to gradually diverge from the actual movement.

Future Improvements

While the current implementation provides a solid foundation for visual odometry, several areas offer opportunities for enhancement:

Real-Time Performance Optimization: To make the system viable for real-time applications, further optimization is necessary. This could involve parallelizing parts of the pipeline or implementing more efficient algorithms for feature detection and matching.
Integration with SLAM: By integrating the visual odometry system with mapping algorithms, the project could evolve into a full Simultaneous Localization and Mapping (SLAM) solution. This would enable the creation of detailed maps of the environment while simultaneously tracking the vehicle’s position within it.

Acknowledgments

KITTI Dataset for providing the stereo image sequences and ground truth
OpenCV community for computer vision tools and algorithms
This project was inspired by the tutorial series: Visual Odometry for Beginners
The accompanying GitHub repository: KITTI_visual_odometry by FoamoftheSea