Active Perception and Mapping for Open Vocabulary Object Goal Navigation

Published in Master Thesis, University of Bonn, 2024

Household robots can elevate the quality of life by automating routine tasks, particularly finding user-specified items within the home. This capability is essential for robots performing tasks such as tidying up, cooking, or assisting individuals with limited mobility. This thesis addresses the challenging task of finding user-specified objects in indoor environments.

In this task, a robot must locate a target object based on a natural language description provided by the user in an initially unknown environment. This task presents two main challenges. First, it requires reliable detection across a wide range of items a user may request, creating a perception challenge. Second, the robot must locate these items within cluttered, varied room layouts, adding an exploration challenge. Recent research suggests that semantically guided, goal-directed exploration can improve the efficiency of locating a target object by guiding the robot toward target-relevant areas. State-of-the-art methods for this problem leverage vision language models (VLMs) to provide semantic guidance by comparing observed images of regions with the linguistic description of the target object, thereby estimating semantic similarity. However, these approaches often overlook the inherent ambiguity in natural language descriptions, which introduces uncertainty into VLM-based predictions. Failing to incorporate this uncertainty can lead to overconfidence, misdirected exploration, and reduced success in locating the target object. Additionally, VLM-guided exploration approaches often employ myopic strategies, focusing on exploring the most semantically similar region at each step without explicitly accounting for future observations. While effective in many cases, this approach may still have limitations in complex environments. We present a novel semantic uncertainty-informed active perception framework to address these challenges. Our framework integrates perception, mapping, and planning for effective object search in household environments. We leverage VLMs for perception, enabling the robot to understand and identify arbitrary objects in the environment based on natural language descriptions. Recognizing the inherent uncertainty in VLM-based perception due to linguistic ambiguities, we present a method to quantify this uncertainty by generating a range of linguistic descriptions that convey the same semantic context but capture diverse interpretations. Using this uncertainty, we construct a probabilistic metric-semantic map that guides exploration based on the estimated semantic similarity of the target object to the various regions in the environment. Our contributions are threefold. First, we propose a method to quantify the uncertainty in semantic similarity derived from VLM-based perception. Second, we develop a probabilistic map that captures uncertainty in semantic similarity. Third, to evaluate the effectiveness of our framework in finding objects, we develop both myopic and non-myopic planners that utilize this map for exploration. Including both approaches allows us to assess how each strategy performs under uncertainty, particularly in complex environments where exploration demands a balance of immediate and future-oriented decisions. Our planners employ an information-theoretic reward function to balance exploiting regions with high expectation of semantic similarity with exploration of regions with high uncertainty. Experimental evaluations demonstrate that our approach achieves comparable or marginally lower success rates than state-of-the-art approaches on this task while performing uncertainty-informed exploration. Finally, we open-source our code for usage by the community.

Recommended citation: Bajpai, Utkarsh. (2024). "Active Perception and Mapping for Open Vocabulary Object Goal Navigation ." Master Thesis. 1(1).
Download Paper