Modularity
VINE's architecture is designed with flexibility at its core, allowing seamless integration with various state-of-the-art segmentation models. VINE can leverage Grounding Dino, YOLO, or SAM for highly accurate zero-shot segmentation—all without modifying the core framework. This modular design means you can choose the segmentation backend that best fits your use case: prioritize speed with YOLO for live applications, accuracy with SAM for detailed analysis, or flexibility with DINO for open world detection. The segmentation module feeds its masks or bounding boxes into our fine-tuned CLIP model, which then generates the spatio-temporal scene graph. This plug-and-play approach ensures VINE stays current with advances in computer vision while maintaining consistent downstream performance.
Efficiency
VINE is engineered for speed and accessibility, running smoothly on everything from consumer CPUs to high-end GPUs and cloud TPUs. Unlike heavyweight video understanding models that demand specialized hardware, VINE's efficient late fusion architecture keeps computational requirements minimal while maintaining real-time performance. The framework is compatible with both PyTorch and JAX, allowing developers to leverage their preferred ecosystem and hardware acceleration.
Inference Performance
| Hardware | Average Time per Frame | FPS | Framework |
|---|---|---|---|
| H100 GPU | 0.015447s (15.447ms) | 64.7 | PyTorch |
| CPU | 0.056939s (56.939ms) | 17.6 | PyTorch |
| H100 GPU | Coming soon | Coming soon | JAX |
Forward pass timing results for VINE model inference per frame across different hardware configurations.
Hardware Compatibility
CPU Support
Low resource allows fast CPU inference with low memory usage
GPU Acceleration
CUDA-enabled GPUs for high-performance inference
Framework Flexibility
Compatible with both PyTorch and JAX ecosystems
Memory Efficient
Low memory footprint enables deployment on resource-constrained devices
Zero-shot Generalizability
VINE has learned a general notion of what scene graphs are -- this understanding enables zero-shot generalization to unfamiliar objects and actions without requiring additional training. We show VINE on various action localization tasks without any further finetuning.
Promptability and Finetunability
VINE's foundation architecture enables powerful downstream adaptation through both prompting and fine-tuning strategies. The model can be dynamically prompted to focus on specific objects and relationships, returning probabilistic confidence scores for detected entities and their interactions.
Select objects and click play to see VINE in action
Probabilistic Prompting
VINE operates probabilistically, allowing you to prompt for specific objects, actions, or relationships and receive confidence scores for all detected entities. Rather than binary detection, VINE provides probability distributions across the entire scene graph, enabling fine-grained control over what the model focuses on during inference.
Fine-tuning & Adaptation
Beyond prompting, VINE can be efficiently fine-tuned for specialized tasks. The modular architecture enables task-specific adaptation through either full finetuning or parameter-efficient techniques while preserving the core video understanding capabilities, making it suitable for domain-specific applications in various fields.
Finetuned Action Recognition Performance
VINE demonstrates strong performance on action recognition across different training scenarios on ActivityNet. VINE uses SGClip as its backbone architecture for scene graph generation. We compare against state-of-the-art action recognition models including BIKE, Text4Vis, ResT, and E2E, showing competitive zero-shot and finetuned capabilities.
| Category | Model | ActivityNet Accuracy (%) |
|---|---|---|
| Zero-shot | VINE | 76.34 |
| CLIP | 74.37 | |
| BIKE | 80.00 | |
| Text4vis | 77.40 | |
| ResT | 26.30 | |
| E2E | 20.00 | |
| Few-shot (1%) | VINE | 80.10 |
| CLIP | 78.79 | |
| Few-shot (5%) | VINE | 86.05 |
| CLIP | 80.02 |
Action recognition accuracy on ActivityNet for zero-shot and few-shot (finetuned on 1% and 5% of the data) models. Zero-shot baselines include state-of-the-art action recognition models (BIKE, Text4Vis, ResT, E2E) and our models evaluated without training.
Dataset
ESCA-Video-87K
A new benchmark for video understanding
87,045 video clips curated and annotated to push the boundaries of video understanding. Each clip is paired with rich, natural language captions crafted by GPT-4.
Our dataset uses precise object traces, dynamically segmented using Grounding DINO and SAM2. With programmatic specifications built in linear temporal logic, every clip becomes a structured video you can track, query, and reason about—frame by frame.
Team
Core Contributors
Collaborators
Faculty

