What a deluge lately!
Deepseek 67B is an amazing coder. Their ~30B model was awesome. Now this!
Qwen 72B. Qwen had awesome models, I expect a lot from this.
Qwen 1.8B for those on a diet.
Qwen Audio for arbitrary audio -> “reasoned” text, so not just transcription.
XVERSE 65B. I haven’t played with this series, how is it?
AquilaChat2-70B. I’m also not familiar.
Those are all heavy hitter foundation LLMs (and all from China).
One more noteworthy LLM is RWKV. He’s releasing increasingly large versions as they train. It’s an RNN (no transformers) that per-parameter-count competes with the transformers, but has O© memory and time complexity for long context windows. Also far lighter to train.
Then for Images, Stability has been on a roll:
Stable Video Stable Diffusion VIDEO. First OS video model I’ve seen.
SD-XL Turbo for Stable Diffusion XL but fast enough to spit out an image per keystroke as you type.
Stability also has an ultrafast upscaler that should come out any day now (Enhance!).
Russia’s Kandinsky is staying fresh with updates.
Fuyu is a noteworthy img->text because it has a simple architecture that tokenizes images (as opposed to CNN), and allows for arbitrarily-sized images.
For audio:
Whisper v3 recently landed for awesome transcription.
Facebook’s MusicGen for music.
Some Text-To-Speech that I’m forgetting now.
For making use of all this:
UniteAI is an OSS project I’ve been involved with to plug local models (llms, stt, RAG, etc) into your text editor of choice. Updates forthcoming.
llama.cpp is a leader in running arbitrary LLMs, especially heavily quantized ones that can run on CPU+RAM instead of GPU+VRAM.
ComfyUI is tying all the image+video gen into a web UI.
Lastly, i’m struggling to find an example, but I’ve seen some GH projects tie together the latent space of multiple different models with different modalities, to create multi-modal models. They do this by lightly training a projection layer between latent spaces. So, we could be close to amazing Multimodal models.
I know I’m missing tons, but these are the highlights on my radar. How about you?
We also have LlaVa and BakLlaVA, two multimodal models based on llama and the latter on mistral.
new to me, but been around for bit. vLLM and autoAWQ
I’m still looking for a good text-to-speech or speech-to-speech that is good and that you can use your own recordings. Any ideas?
See the SeamlessM4T model by Facebook.
https://github.com/facebookresearch/seamless_communication?s=03