We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated
captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of
Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different
levels of information and summarizes them efficiently. Our approach can be applied to enhance video
understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an
LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth
captions. We further build four human-annotated datasets in three domains: autonomous driving, general
scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior
captioning performance compared to state-of-the-art approaches from the research community (VILA1.5,
CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf
improves CapScore (caption quality) by 55.6% and CapScore (caption similarity) by 77.4% on challenging
driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming
to accelerate advancements in video understanding, captioning, and data alignment.