KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images

Abstract

The six-degree-of-freedom (6-DoF) pose and metric size estimation of multiple objects from RGB images alone remains a challenging task, particularly due to significant variations in object shape, appearance, and frequent occlusions in complex scenes. To address these challenges, we introduce KMOPS, a Keypoint-driven method tailored specifically for Multi-Object Pose and metric Size estimation from a single calibrated stereo image pair. Leveraging the stereo input, our approach first extracts the 2D keypoints of the enclosing bounding boxes of the objects across both views, and subsequently triangulates them to acquire metric 3D positions. Then, we obtain each object's rotation, translation, and dimensions by aligning the triangulated 3D keypoints to the canonical ones using a closed-form solution. Our formulation eliminates the need for predefined 3D search spaces or volumetric anchors, which are often required by other methods to constrain the vast 3D solution space. With extensive experiments on the challenging dataset Transparent Object Dataset (TOD) and StereOBJ-1M, we show that our method outperforms all competing methods with a simple and effective architecture.

Overview of KMOPS

framework

Overview of our model. We first extract features from both stereo views using a shared RT-DETR encoder and fuse them with an MLP fusion block. From the fused features we build N × (K+1) queries from N object tokens and N × K keypoint tokens. These queries are refined by 3 Transformer decoders iteratively and are then passed to prediction heads that output object category scores, K averaged 2D keypoint positions in image coordinates, and per-keypoint disparities and visibility scores for each object. Using stereo geometry, we lift the predictions to metric 3D keypoints and fit them to canonical keypoints to recover each object's pose and size.

Comparison with SOTA

We compare with CubeRCNN, DetAny3D, CenterPose, and PETR on the validation sets of the StereOBJ-1M dataset.

Quantitative results comparison
Figure: Quantitative comparison on the StereOBJ-1M validation set.

BibTeX

@InProceedings{Wu_2026_WACV,
        author    = {Wu, Ying-Kun and Shen, Yi and Huang, Tzuhsuan and Fang, I-Sheng and Chen, Jun-Cheng},
        title     = {KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images},
        booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
        month     = {March},
        year      = {2026},
        pages     = {7730-7739}
    }