CyclingVQA

From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?
[Preprint, 2026]

Munich, Bavaria, Germany (Webpage under construction)

🔥 Highlights

  1. Cyclist-Centric Dataset. We introduce a new dataset comprising 2,053 multiple-choice visual question–answer pairs derived from 705 real-world cyclist egocentric images. This resource is designed to bridge the gap in ego-perspective urban cycling data.

  2. Specialized Spatial Tasks. We define eight specific evaluation tasks that probe cyclist-centric spatial perception, traffic rule compliance, and navigation-relevant reasoning within complex urban environments, moving beyond general object detection.

  3. Benchmarking VLMs. We conduct a comprehensive evaluation of state-of-the-art Vision-Language Models (VLMs), including general-purpose, spatially enhanced, and Autonomous Driving (AD)-focused models, revealing significant room for improvement in cyclist-specific reasoning.

  4. Systematic Failure Analysis. We perform a granular analysis of model failure modes, categorizing recurring error types to provide a clear technical roadmap for the development of future cyclist-assistive intelligent systems.

Benchmark Tasks

Benchmark tasks illustration

Benchmark tasks. Illustration of the eight benchmark tasks in CyclingVQA, showing example question prompts together with visual inputs augmented by lane annotations and bounding-box supervision.

Full Leaderboard

Model Size Type Release SU TSG TO TSR+S RED TSR LR SAA Avg Rank
Random - - - 49.5 16.3 51.3 12.9 42.4 13.4 36.0 51.7 34.2 31
Proprietary VLMs
Gemini-2.5-Flash N/A Reason 07/2025 77.5 98.1 56.5 82.2 93.7 83.5 72.6 86.3 81.3 1
GPT-5.1 N/A Reason 11/2025 63.7 89.8 59.1 82.8 93.2 85.4 60.4 83.7 77.3 2
Generalist VLMs
Qwen3-VL (8B)8BInstruct11/202575.889.353.078.594.880.858.581.776.63
Ovis2.5-9B9BInstruct08/202570.997.353.081.593.272.063.479.576.44
InternVL3.5-8B8BInstruct08/202557.788.654.863.586.962.862.278.769.46
Qwen3-VL (2B)2BInstruct11/202551.197.353.974.083.873.937.276.068.48
Ovis2.5-2B2BInstruct08/202572.596.451.363.184.359.045.179.168.87
Eagle2.5-8B8BInstruct04/202553.887.853.050.481.739.852.482.162.712
Phi-48BInstruct02/202544.573.751.359.275.959.851.273.861.214
InternVL38BInstruct04/202551.187.849.649.681.748.341.573.460.415
InternVL3.5-2B2BInstruct08/202562.677.453.947.474.348.743.366.959.317
Qwen2.5-VL7BInstruct02/202452.781.353.044.070.242.947.675.758.418
FoundationMotion7BInstruct12/202549.582.553.043.368.639.559.168.858.019
Molmo2-8B8BInstruct12/202556.688.153.034.173.337.249.443.754.422
LLaVA-OneVision7BInstruct06/202454.465.550.437.665.434.137.271.152.024
LLaVA-Next8BInstruct04/202444.536.053.025.153.937.527.433.838.928
LLaVA-1.67BInstruct12/202347.823.153.026.247.632.615.262.738.529
Spatial-Aware VLMs
PerceptionLM (8B)8BInstruct04/202579.195.150.468.585.378.567.158.272.85
SpatialThinker7BReason11/202558.894.952.257.586.947.943.371.964.211
SenseNova8BInstruct10/202578.070.653.048.382.249.850.068.862.613
PerceptionLM (3B)3BInstruct04/202556.087.653.049.885.966.343.335.059.616
VST7BReason11/202578.672.353.032.457.130.751.273.056.021
SpatialReasoner7BReason04/202537.455.045.233.754.531.056.154.845.927
Driving-Centric VLMs
Cosmos-Reason28BReason12/202552.279.354.873.486.963.256.771.167.29
DriveLMMo18BReason03/202557.776.252.243.372.346.042.171.957.720
Cosmos-Reason17BReason03/202545.660.855.735.264.442.151.278.754.223
ReCogDrive8BInstruct06/202547.850.153.037.154.538.753.766.250.125
DriveMM7BInstruct12/202454.454.353.030.059.729.545.766.249.126
Dolphins7BInstruct12/202345.615.337.413.944.016.151.273.437.130

Table 2. Evaluation of VLMs on the CyclingVQA benchmark. Accuracy (%) is reported for eight tasks. General tasks include SU, TSG, and TO, while domain-specific tasks include TSR+S, RED, TSR, LR, and SAA.

Impact of CoT Prompting

CoT vs Standard Prompting

CoT vs. Standard Prompting. Overall performance degrades under CoT prompting across the three instruct models. This observation suggests that current VLMs may struggle to maintain spatial consistency during extended reasoning chains for cyclist-centric tasks.

Systematic Failure Analysis

Taxonomy of Failure Modes

Figure 3. Taxonomy of Failure Modes. We characterize model errors across four recurring categories, providing a systematic overview of current VLM limitations in cyclist-centric scenarios. This analysis serves as a roadmap for developing more robust spatial reasoning capabilities in future cyclist-assistive intelligent systems.

Generation Verbosity Analysis

Characterizing Generation Verbosity

Characterizing Generation Verbosity. We report the mean number of tokens generated per response across different model families. This analysis reveals how different VLM architectures balance conciseness with reasoning depth when addressing cyclist-centric spatial queries.

Visitors

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to CDA, BIA for releasing the pretrained models.