From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

National University of Singapore
R-AL 2026

Our approach respects social etiquette, minimizes disruption and risk in navigation beyond merely reaching a goal.

Abstract

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning.

This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller.

This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human–robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions.

Video

System

Social robot navigation task can be conceptually understood as a multi-objective optimization problem, where one set of costs captures path specifications (e.g., time to goal) and environment constraints (e.g., untraversable obstacles), and another encodes social considerations. We assume the overall objective is decomposable into geometric feasibility and in-context semantics, with semantic optima forming a well-covered subset of the geometric-optimal set.

Core idea: Generate path candidates that satisfy geometric constraints from sensor data, then use a fine-tuned VLM to select the socially compliant path.

Interpolate start reference image.

Results

Without social compliance, the goal is set directly ahead of the start position (red map pin), producing a shortest collision-free path. However, such paths violate social norms. Our method performs well across different scenarios, enabling collision-free navigation while respecting social zones and minimizing personal space violation.

Interpolate start reference image.

We compare social compliance against five representative baselines—including group-based, RL-based, and VLM-based methods—across four scenarios. Our method consistently achieves the lowest proportion of socially non-compliant behavior relative to navigation time on all metrics: Personal Space Violation duration (PSV), Time Facing Pedestrians (TFP), Social-zone Interruption Time (SIT), and Maximum Social-zone Interruption Ratio (Max. SIR).

Interpolate start reference image.

Robusteness

Across multiple trials and varying goal positions (yellow triangle), the system maintains socially compliant navigation and exhibits multi-modal detouring behaviors around human groups (e.g., detouring from either side).

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}