🐺 Wolf: Captioning Everything
with a World Summarization Framework

1NVIDIA, 2UC Berkeley, 3MIT, 4UT Austin, 5University of Toronto, 6Stanford University


Method Caption Similarity ↑ Caption Quality (eg. reduced hallucination) ↑
Scenarios Organization Nuscenes Pexels Robotics Nuscenes Pexels Robotics
CogAgent Tsinghua&Zhipu AI 0.18 0.68 0.38 0.24 0.72 0.43
GPT-4V OpenAI 0.31 0.72 0.34 0.36 0.75 0.35
VILA-1.5-13b NVIDIA 0.21 0.85 0.62 0.25 0.86 0.67
Gemini-Pro-1.5 Google 0.42 0.87 0.63 0.45 0.87 0.67
Wolf(ours) NVIDIA 0.55 0.88 0.72 0.56 0.89 0.75

We are continuously adding more models to the leaderboard. If you would like to add your model, please contact us!