Modularity

VINE's architecture is designed with flexibility at its core, allowing seamless integration with various state-of-the-art segmentation models. VINE can leverage Grounding Dino, YOLO, or SAM for highly accurate zero-shot segmentation—all without modifying the core framework. This modular design means you can choose the segmentation backend that best fits your use case: prioritize speed with YOLO for live applications, accuracy with SAM for detailed analysis, or flexibility with DINO for open world detection. The segmentation module feeds its masks or bounding boxes into our fine-tuned CLIP model, which then generates the spatio-temporal scene graph. This plug-and-play approach ensures VINE stays current with advances in computer vision while maintaining consistent downstream performance.

VINE run with SAM with masks
VINE run with Grounding DINO with bounding boxes

Efficiency

VINE is engineered for speed and accessibility, running smoothly on everything from consumer CPUs to high-end GPUs and cloud TPUs. Unlike heavyweight video understanding models that demand specialized hardware, VINE's efficient late fusion architecture keeps computational requirements minimal while maintaining real-time performance. The framework is compatible with both PyTorch and JAX, allowing developers to leverage their preferred ecosystem and hardware acceleration.

Inference Performance

HardwareAverage Time per FrameFPSFramework
H100 GPU0.015447s (15.447ms)64.7PyTorch
CPU0.056939s (56.939ms)17.6PyTorch
H100 GPUComing soonComing soonJAX

Forward pass timing results for VINE model inference per frame across different hardware configurations.

Hardware Compatibility

🖥️

CPU Support

Low resource allows fast CPU inference with low memory usage

GPU Acceleration

CUDA-enabled GPUs for high-performance inference

🔧

Framework Flexibility

Compatible with both PyTorch and JAX ecosystems

💾

Memory Efficient

Low memory footprint enables deployment on resource-constrained devices

Zero-shot Generalizability

VINE has learned a general notion of what scene graphs are -- this understanding enables zero-shot generalization to unfamiliar objects and actions without requiring additional training. We show VINE on various action localization tasks without any further finetuning.

Zero-shot action localization across diverse scenarios without additional training

Promptability and Finetunability

VINE's foundation architecture enables powerful downstream adaptation through both prompting and fine-tuning strategies. The model can be dynamically prompted to focus on specific objects and relationships, returning probabilistic confidence scores for detected entities and their interactions.

🎯

Select objects and click play to see VINE in action

Probabilistic Prompting

VINE operates probabilistically, allowing you to prompt for specific objects, actions, or relationships and receive confidence scores for all detected entities. Rather than binary detection, VINE provides probability distributions across the entire scene graph, enabling fine-grained control over what the model focuses on during inference.

Fine-tuning & Adaptation

Beyond prompting, VINE can be efficiently fine-tuned for specialized tasks. The modular architecture enables task-specific adaptation through either full finetuning or parameter-efficient techniques while preserving the core video understanding capabilities, making it suitable for domain-specific applications in various fields.

Finetuned Action Recognition Performance

VINE demonstrates strong performance on action recognition across different training scenarios on ActivityNet. VINE uses SGClip as its backbone architecture for scene graph generation. We compare against state-of-the-art action recognition models including BIKE, Text4Vis, ResT, and E2E, showing competitive zero-shot and finetuned capabilities.

CategoryModelActivityNet Accuracy (%)
Zero-shotVINE76.34
CLIP74.37
BIKE80.00
Text4vis77.40
ResT26.30
E2E20.00
Few-shot (1%)VINE80.10
CLIP78.79
Few-shot (5%)VINE86.05
CLIP80.02

Action recognition accuracy on ActivityNet for zero-shot and few-shot (finetuned on 1% and 5% of the data) models. Zero-shot baselines include state-of-the-art action recognition models (BIKE, Text4Vis, ResT, E2E) and our models evaluated without training.

Dataset

ESCA-Video-87K

A new benchmark for video understanding

87,045 video clips curated and annotated to push the boundaries of video understanding. Each clip is paired with rich, natural language captions crafted by GPT-4.

Our dataset uses precise object traces, dynamically segmented using Grounding DINO and SAM2. With programmatic specifications built in linear temporal logic, every clip becomes a structured video you can track, query, and reason about—frame by frame.

ESCA-Video-87K dataset samples
87K+
Video Clips
100
Trajectories per vid
500k+
Masks
GPT-4
Captions

Team

Core Contributors

University of Pennsylvania
University of Pennsylvania
University of Pennsylvania

Collaborators

University of Pennsylvania
University of Pennsylvania
University of Pennsylvania

Faculty

University of Pennsylvania
University of Central Florida