Name Cost Arch Link Licence Input Output Remark
Vid2Seq: Visual Language Model for Dense Video Captioning 500M parameters (200M text, 300M video) Page 3 Paper Open Text + Video Text Since it was trained on an unlabeled narrated dataset, we can fine-tune it to our dataset (automated curation possibility needs to be checked) with transfer learning.
OpenFlamingo: Vision Language Model 3B to 9B parameters variants Flamingo Arch, DeepMind Paper Open Image-Text Text This is an open-sourced model that reproduces Google’s Flamingo, which was closed. It is also dependent on CLIP and LLaMA.
LLaMA: Open and Efficient Foundation Language Models 7B to 65B parameters variants Open Source Research from Meta paper Open Text Text It is an open-sourced LLM from Meta that competes with other closed or limited LLMs from other organizations.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning 1.3B parameters Fig 2 Paper Open Text-Video Various vision-related tasks (described in their Git repo) For training, it is using many open-source video-text datasets.
InternImage: Exploring Large-Scale Vision Foundation Models Unknown Fig 3 Paper Open Image-Text General image-related tasks Open-source model from OpenGVLab.
Make-A-Video: Text-to-Video Generation without Text-Video Data Unknown Fig 2 Paper Closed Text-Image / Video Video Generation / Image Animation / Video Variation Their specifically curated dataset is also limited for use only.
CogVideo: Transformer Model for Text-to-Video Generation 9.4 billion parameters, trained on 5.4 million text-video pairs Fig 2 Paper Open Text-Video Video Demos of the model are available on Hugging Face.
CLAP Unknown Arch Paper Open Audio-Text Audio -
VLMo: a model for text-to-image generation 512M parameters Fig 1 Paper Closed Text-Image Vision-Language Tasks Developed by Microsoft. Code repo under Microsoft License.
CogView: a transformer model for text-to-image generation 4B parameters Fig 3 Paper Open Text-Image Image Open-source model from THUDM.
Jukebox: A Generative Model for Music 5B parameters Fig 8 Paper Open Audio Audio Open-source project from OpenAI.
Gorilla: Large Language Model Connected with Massive APIs 7B parameters Fig 3 Paper Open Text API Call Code Fine-tuned LLaMA-based model performing better than GPT-4 on writing API calls from natural language input.
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale 330M parameters Not in the paper Paper Closed Text-Audio Audio First generative AI model for speech to generalize across tasks with SOTA from META.
Otter: A Multi-Modal Model with In-Context Instruction Tuning 1.3B parameters Arch Paper open Text-Image Text Multi-modal model based on OpenFlamingo.