DOI

10.5703/1288284318544

Description

Many existing software tutorials lack clear interaction cues (clicks, drags, shortcuts), causing learners to frequently rewind and lose flow. To address this, we propose a system that infers user actions by analyzing frame-to-frame changes using a multimodal LLM, then overlays predefined visual indicators, such as click ripples, drag trails, and shortcut labels, at precise cursor positions. To ensure reliable task classification and avoid hallucinations, we integrate RAG, grounding the model with official software documentation. This approach aims to enhance older tutorials with accurate, actionable interaction feedback, improving clarity and learning efficiency.

Share

COinS
 

Visual Augmented Tutorial Agent for Software Tutorial

Many existing software tutorials lack clear interaction cues (clicks, drags, shortcuts), causing learners to frequently rewind and lose flow. To address this, we propose a system that infers user actions by analyzing frame-to-frame changes using a multimodal LLM, then overlays predefined visual indicators, such as click ripples, drag trails, and shortcut labels, at precise cursor positions. To ensure reliable task classification and avoid hallucinations, we integrate RAG, grounding the model with official software documentation. This approach aims to enhance older tutorials with accurate, actionable interaction feedback, improving clarity and learning efficiency.