Jormungandr: End-to-End Video Object Detection with Spatial-Temporal Mamba

![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/Knolaisen/jormungandr/ci.yml) ![GitHub top language](https://img.shields.io/github/languages/top/Knolaisen/jormungandr) ![GitHub language count](https://img.shields.io/github/languages/count/Knolaisen/jormungandr) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Project Version](https://img.shields.io/badge/version-0.0.1-blue)](https://img.shields.io/badge/version-0.0.1-blue)

📋 Table of contents

- [Jormungandr: End-to-End Video Object Detection with Spatial-Temporal Mamba](#jormungandr-end-to-end-video-object-detection-with-spatial-temporal-mamba) - [Description](#description) - [Getting started](#getting-started) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Usage](#usage) - [Still Image Detection (Fafnir)](#still-image-detection-fafnir) - [Video Object Detection (Jormungandr)](#video-object-detection-jormungandr) - [Pretrained Models](#pretrained-models) - [Documentation](#documentation) - [Authors](#authors) - [License](#license)

Description

Jormungandr is an novel end-to-end video object detection system that leverages the Spatial-Temporal Mamba architecture to accurately detect and track objects across video frames. By combining spatial and temporal information, Jormungandr enhances detection accuracy and robustness, making it suitable for various applications such as collision avoidance, search and rescue operations, surveillance, autonomous driving, and video analytics.

Getting started

Prerequisites

Before installing this package, ensure that your system meets the following requirements:

Operating System: Linux
Python: Version 3.12 or higher
Hardware: CUDA-enabled GPU
Software Dependencies:
NVIDIA drivers compatible with your GPU
CUDA Toolkit properly installed and configured, can be checked with nvidia-smi

Installation

PyPI package:

pip install jormungandr-ssm

Alternatively, from source:

pip install git+https://github.com/Knolaisen/jormungandr

Usage

We expose several levels of interface with the Fafnir still image detector and Jormungandr Video Object Detection (VOD) model. Both models follow a simple PyTorch-style API. Due to the Mamba architecture, the models are optimized for GPU execution and require CUDA for inference and training.

Still Image Detection (Fafnir)

Use Fafnir when performing object detection on single images.

import torch
from jormungandr import Fafnir

device = torch.device("cuda")

batch, channels, height, width = 2, 3, 224, 224
x = torch.randn(batch, channels, height, width).to(device)

# Initialize model
model = Fafnir(variant="fafnir-b", pretrained=True).to(device)
model.eval()

# Inference
with torch.no_grad():
    detections = model(x)

Video Object Detection (Jormungandr)

Use Jormungandr for end-to-end video object detection using spatial-temporal modeling.

import torch
from jormungandr import Jormungandr

device = torch.device("cuda")

frames, channels, height, width = 8, 3, 224, 224
x = torch.randn(frames, channels, height, width).to(device)

# Initialize model
model = Jormungandr(variant="jormungandr-b", pretrained=True).to(device)
model.eval()

# Inference
with torch.no_grad():
    detections = model(x)

Pretrained Models

We provide pretrained models hosted on Hugging Face.

The Fafnir models (fafnir-t, fafnir-s, fafnir-b) are pretrained on the COCO dataset.
The Jormungandr models (jormungandr-t, jormungandr-s, jormungandr-b) are pretrained on the MOT17 dataset.

These models will be automatically downloaded when initialized in your code.

Documentation

Authors

_{Kristoffer Nohr Olaisen}

_{Sverre Nystad}

License

Distributed under the MIT License. See LICENSE for more information.