Acceleration and accuracy of manufacturing process capture with the use of VLMs
How vision-language models compress hours of manual capture, editing, and maintenance of work instructions into minutes — and what it takes to make this reliable in production.
TL;DR
Problem
- Creating manufacturing work instructions manually is a lot of work (capture, media, writing).
- Instructions are hard to maintain; updates often mean redoing the whole workflow.
Context
- Visual language models (VLMs) join vision and text: they can interpret demonstrations, frames, and prompts in one pipeline—and they’re getting more capable and deployable (cloud and increasingly local).
Solution
- VLMs are an unlock for faster, richer process documentation—in our trials, creation time dropped by as much as ~90% vs fully manual methods.
- Owning how you create—fine-tuning capture, prompts, review, and where the model runs—is what makes it work in production.
Problems
Capturing
Creating accurate and accessible manufacturing procedures is a critical but time-intensive task for many organizations. The process is complicated by several persistent challenges that slow down the pace of progress and introduce opportunities for error. First and foremost are the technical obstacles. Capturing the necessary photos and videos to illustrate each step in a procedure often requires juggling multiple devices—cameras, tablets, smartphones—all of which create their own media formats and require different methods for transferring files. These assets frequently must be moved manually, typically with SD cards, cables, or unreliable wireless transfers, resulting in a workflow that is both tedious and error-prone. Files can become misplaced, duplicated, or corrupted, adding further frustration and lost time.
Editing
Once these images and videos are finally collected, the challenge shifts to documentation. There is no universally accepted tool specifically designed for creating rich, multimedia work instructions in the manufacturing sector. Many teams default to generic office applications—Word, PowerPoint, Excel—to organize their content. Unfortunately, none of these tools are truly optimized for this purpose. Embedding and arranging media can be clumsy, and collaboration is limited or unwieldy, especially when teams need to keep documentation up-to-date as processes evolve. Furthermore, these formats make it difficult to standardize procedures across departments or sites, leading to inconsistencies in quality and presentation.
Maintenance
This lack of effective tools and integrated workflows directly impacts operational efficiency. Technically oriented staff are forced to spend a disproportionate amount of time on basic documentation tasks rather than on higher-value activities such as process improvement. Maintenance of instructions is also tedious—updates or revisions to existing documentation typically require repeating the entire manual workflow, making it less likely that procedures reflect the latest best practices on the shop floor. Ultimately, these technical and procedural bottlenecks slow down the transfer of knowledge in manufacturing settings, limit the ability to respond to changes quickly, and hinder the adoption of new or improved work methods.
Given / Context
At a high level, a vision-language model (VLM) combines an image path (preprocessor → ViT → projection into token space) with a text path (tokenizer → embeddings) inside a shared LLM, then decodes language tokens back to text (embeddings → de-tokenizer). The schematic below follows that multimodal layout—aligned with the prose in the following subsections.
Visual Language Models?
Visual Language Models (VLMs), also known as Vision-Language Models, represent the state of the art in integrating visual and textual information. VLMs are extensions of the powerful transformer architecture that underpin natural language processing advances, but they are adapted to jointly process both images (or videos) and text. At the technical core, these models receive visual inputs (such as frames from a camera or screenshots) and textual inputs (such as instructions or prompts), and learn to encode both modalities into a shared semantic space. This enables the models to “understand” not just what is shown in an image, but also how it relates to language, tasks, or step-by-step procedures.
How Are VLMs Trained and What Can They Do?
Contemporary VLMs, including OpenAI’s GPT-4V, Google’s Gemini, Meta’s LLaVA and IDEFICS, and Alibaba’s Qwen-VL, are pretrained on massive datasets composed of image-text pairs scraped from the web, as well as increasingly curated collections such as instructional manuals or domain-specific datasets. During training, they learn to generate descriptions (captions), answer questions about images (“what is happening here?”), and even generate step lists or detailed instructions grounded in what’s visible. Their vision encoders are typically based on high-capacity models like CLIP or vision transformers (ViT), while their language encoders/decoders leverage large transformer LLMs. After pretraining, these models can be further fine-tuned for specialized tasks, such as process documentation or manufacturing instructions.
Multi-Modal Processing and Practical Benefits
Importantly, VLMs work in a “multi-modal” manner: that is, they are able to connect what they “see” with what they “read” or “write.” For example, given a demonstration video of an assembly process, a VLM can segment the process into discrete steps, extract key frames, and generate concise, human-readable explanations for each stage. This enables a streamlined documentation workflow where much of the tedious manual description and screenshotting is handled by the model. Advanced VLMs can run on the cloud and, increasingly, on local hardware, allowing for flexible deployment depending on organizational security or data privacy needs. As the technology matures, VLMs are rapidly closing the performance gap between proprietary (cloud-based) offerings and open-source, locally deployable models, making them an attractive option for accelerating and improving process knowledge capture.
Solution / Results
Accelerated Process Creation
The introduction of Visual Language Models (VLMs) into the process documentation workflow has proven transformative, dramatically reducing the time and effort required to create accurate manufacturing procedures. In our trials, leveraging VLMs for process capture and step description accelerated process creation by as much as 90%, compared to traditional, fully manual methods. What used to take hours—such as extracting key frames from video, generating step-by-step instructions, and organizing multimedia—can now be accomplished in mere minutes. VLMs can automatically interpret demonstration videos, segment procedures into logical steps, generate concise captions, and even enrich documentation with contextually relevant details that may have otherwise been overlooked.
Quality, Security, and Control
Beyond sheer speed, VLMs offer additional advantages in maintaining both quality and security. While the most powerful VLMs currently come from proprietary cloud providers—offering best-in-class performance—recent advancements in open weight and locally deployable models have begun to close the gap. Many organizations are understandably concerned about the privacy of their manufacturing data. The good news is that VLMs can increasingly be run on local hardware, ensuring that sensitive videos and process information never leave the organization’s secure environment. Although open source models may have previously lagged behind commercial alternatives, they are now reliable enough to deliver substantial efficiency gains, and offer a promising path for companies seeking more control over their data.
Impact on Documentation
In summary, by integrating VLMs into manufacturing process capture, organizations are able to greatly increase documentation speed, reduce repetitive manual tasks, and ensure that work instructions are kept accurate and up to date. Whether using state-of-the-art cloud services or privacy-preserving local deployments, the current wave of VLMs makes it possible to systematize and scale process knowledge in a way that was not previously feasible.