Vid2Seq: Visual Language Model for Dense Video Captioning |
500M parameters (200M text, 300M video) |
Page 3 |
Paper |
Open |
Text + Video |
Text |
Since it was trained on an unlabeled narrated dataset, we can fine-tune it to our dataset (automated curation possibility needs to be checked) with transfer learning. |
OpenFlamingo: Vision Language Model |
3B to 9B parameters variants |
Flamingo Arch, DeepMind |
Paper |
Open |
Image-Text |
Text |
This is an open-sourced model that reproduces Google’s Flamingo, which was closed. It is also dependent on CLIP and LLaMA. |
LLaMA: Open and Efficient Foundation Language Models |
7B to 65B parameters variants |
Open Source Research from Meta |
paper |
Open |
Text |
Text |
It is an open-sourced LLM from Meta that competes with other closed or limited LLMs from other organizations. |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning |
1.3B parameters |
Fig 2 |
Paper |
Open |
Text-Video |
Various vision-related tasks (described in their Git repo) |
For training, it is using many open-source video-text datasets. |
InternImage: Exploring Large-Scale Vision Foundation Models |
Unknown |
Fig 3 |
Paper |
Open |
Image-Text |
General image-related tasks |
Open-source model from OpenGVLab. |
Make-A-Video: Text-to-Video Generation without Text-Video Data |
Unknown |
Fig 2 |
Paper |
Closed |
Text-Image / Video |
Video Generation / Image Animation / Video Variation |
Their specifically curated dataset is also limited for use only. |
CogVideo: Transformer Model for Text-to-Video Generation |
9.4 billion parameters, trained on 5.4 million text-video pairs |
Fig 2 |
Paper |
Open |
Text-Video |
Video |
Demos of the model are available on Hugging Face. |
CLAP |
Unknown |
Arch |
Paper |
Open |
Audio-Text |
Audio |
- |
VLMo: a model for text-to-image generation |
512M parameters |
Fig 1 |
Paper |
Closed |
Text-Image |
Vision-Language Tasks |
Developed by Microsoft. Code repo under Microsoft License. |
CogView: a transformer model for text-to-image generation |
4B parameters |
Fig 3 |
Paper |
Open |
Text-Image |
Image |
Open-source model from THUDM. |
Jukebox: A Generative Model for Music |
5B parameters |
Fig 8 |
Paper |
Open |
Audio |
Audio |
Open-source project from OpenAI. |
Gorilla: Large Language Model Connected with Massive APIs |
7B parameters |
Fig 3 |
Paper |
Open |
Text |
API Call Code |
Fine-tuned LLaMA-based model performing better than GPT-4 on writing API calls from natural language input. |
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale |
330M parameters |
Not in the paper |
Paper |
Closed |
Text-Audio |
Audio |
First generative AI model for speech to generalize across tasks with SOTA from META. |
Otter: A Multi-Modal Model with In-Context Instruction Tuning |
1.3B parameters |
Arch |
Paper |
open |
Text-Image |
Text |
Multi-modal model based on OpenFlamingo. |