Jormungandr: End-to-End Video Object Detection with Spatial-Temporal Mamba
📋 Table of contents
- [Jormungandr: End-to-End Video Object Detection with Spatial-Temporal Mamba](#jormungandr-end-to-end-video-object-detection-with-spatial-temporal-mamba) - [Description](#description) - [Getting started](#getting-started) - [Installation](#installation) - [Usage](#usage) - [Still Image Detection (Fafnir)](#still-image-detection-fafnir) - [Video Object Detection (Jormungandr)](#video-object-detection-jormungandr) - [Pretrained Models](#pretrained-models) - [Documentation](#documentation) - [Authors](#authors) - [License](#license)Description
Jormungandr is an novel end-to-end video object detection system that leverages the Spatial-Temporal Mamba architecture to accurately detect and track objects across video frames. By combining spatial and temporal information, Jormungandr enhances detection accuracy and robustness, making it suitable for various applications such as surveillance, autonomous driving, and video analytics.
Getting started
Prerequisites
Before installing this package, ensure that your system meets the following requirements:
- Operating System: Linux
- Python: Version 3.12 or higher
- Hardware: CUDA-enabled GPU
- Software Dependencies:
- NVIDIA drivers compatible with your GPU
- CUDA Toolkit properly installed and configured, can be checked with
nvidia-smi
Installation
PyPI package:
pip install jormungandr-ssm
Alternatively, from source:
pip install git+https://github.com/Knolaisen/jormungandr
Usage
We expose several levels of interface with the Fafnir still image detector and Jormungandr Video Object Detection (VOD) model. Both models follow a simple PyTorch-style API. Due to the Mamba architecture, the models are optimized for GPU execution and require CUDA for inference and training.
Still Image Detection (Fafnir)
Use Fafnir when performing object detection on single images.
import torch
from jormungandr import Fafnir
device = torch.device("cuda")
batch, channels, height, width = 2, 3, 224, 224
x = torch.randn(batch, channels, height, width).to(device)
# Initialize model
model = Fafnir(variant="fafnir-b", pretrained=True).to(device)
model.eval()
# Inference
with torch.no_grad():
detections = model(x)
Video Object Detection (Jormungandr)
Use Jormungandr for end-to-end video object detection using spatial-temporal modeling.
import torch
from jormungandr import Jormungandr
device = torch.device("cuda")
batch, frames, channels, height, width = 32, 8, 3, 224, 224
x = torch.randn(batch, frames, channels, height, width).to(device)
# Initialize model
model = Jormungandr(variant="jormungandr-b", pretrained=True).to(device)
model.eval()
# Inference
with torch.no_grad():
detections = model(x)
Pretrained Models
We provide pretrained models hosted on Hugging Face.
- The Fafnir models (
fafnir-t,fafnir-s,fafnir-b) are pretrained on the COCO dataset. - The Jormungandr models (
jormungandr-t,jormungandr-s,jormungandr-b) are pretrained on the MOT17 dataset.
These models will be automatically downloaded when initialized in your code.
Documentation
Authors
![]() Kristoffer Nohr Olaisen |
![]() Sverre Nystad |
License
Distributed under the MIT License. See LICENSE for more information.

