39hr0yogthos in technology @lemmy.mlLatest in Open Source Multimodal AIhttps://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-38-fromPE-AV - Audiovisual Perception with Code Meta's perception encoder for audio-visual understanding with open code release. Processes both visual and audio information to isolate sound sources. Paper | Code https://preview.redd.it/k6lp7cgbou8g1.png?width=1456&format=png&auto=webp&s=f928bbd8d184e9094e7130cb36adff5f51830a80 T5Gemma 2 - Open Encoder-Decoder Next generation encoder-decoder model with full open-source weights. Combines bidirectional understanding with flexible text generation. Blog | Model Qwen-Image-Layered - Open Image Decomposition Decomposes images into editable RGBA layers with full model release. Each layer can be independently manipulated for precise editing. Hugging Face | Paper | Demo https://reddit.com/link/1ptg2x9/video/72skjufkou8g1/player N3D-VLM - Open 3D Vision-Language Model Native 3D spatial reasoning with open weights and code. Understands depth and spatial relationships without 2D distortions. GitHub | Model https://reddit.com/link/1ptg2x9/video/h1npuq1mou8g1/player Generative Refocusing - Open Depth Control Controls depth of field in images with full code release. Simulates camera focus changes through 3D scene inference. Website | Demo | Paper | GitHub StereoPilot - Open 2D to 3D Conversion Converts 2D videos to stereo 3D with open model and code. Full source release for VR content creation. Website | Model | GitHub | Paper https://reddit.com/link/1ptg2x9/video/homrv9tmou8g1/player Chatterbox Turbo - MIT Licensed TTS State-of-the-art text-to-speech under permissive MIT license. No commercial restrictions or cloud dependencies. Hugging Face https://reddit.com/link/1ptg2x9/video/iceqr03jou8g1/player FunctionGemma - Open Function Calling Lightweight 270M parameter model for function calling with full weights. Creates specialized function calling models without commercial restrictions. Model FoundationMotion - Open Motion Analysis Labels spatial movement in videos with full code and dataset release. Automatic motion pattern identification without manual annotation. Paper | GitHub | Demo | Dataset DeContext - Open Image Protection Protects images from unwanted AI edits with open-source implementation. Adds imperceptible perturbations that block manipulation while preserving quality. Website | Paper | GitHub EgoX - Open Perspective Transformation Transforms third-person videos to first-person with full code release. Maintains spatial coherence during viewpoint conversion. Website | Paper | GitHub https://reddit.com/link/1ptg2x9/video/2h8x59qpou8g1/player Step-GUI - Open GUI Automation SOTA GUI automation with self-evolving pipeline and open weights. Full code and model release for interface control. Paper | GitHub | Model IC-Effect - Open Video Effects Applies video effects through in-context learning with code release. Learns effect patterns from examples without fine-tuning. Website | GitHub | Paper
yogthos in technology @lemmy.ml
Latest in Open Source Multimodal AI
https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-38-fromPE-AV - Audiovisual Perception with Code
https://preview.redd.it/k6lp7cgbou8g1.png?width=1456&format=png&auto=webp&s=f928bbd8d184e9094e7130cb36adff5f51830a80
T5Gemma 2 - Open Encoder-Decoder
Qwen-Image-Layered - Open Image Decomposition
https://reddit.com/link/1ptg2x9/video/72skjufkou8g1/player
N3D-VLM - Open 3D Vision-Language Model
https://reddit.com/link/1ptg2x9/video/h1npuq1mou8g1/player
Generative Refocusing - Open Depth Control
StereoPilot - Open 2D to 3D Conversion
https://reddit.com/link/1ptg2x9/video/homrv9tmou8g1/player
Chatterbox Turbo - MIT Licensed TTS
https://reddit.com/link/1ptg2x9/video/iceqr03jou8g1/player
FunctionGemma - Open Function Calling
FoundationMotion - Open Motion Analysis
DeContext - Open Image Protection
EgoX - Open Perspective Transformation
https://reddit.com/link/1ptg2x9/video/2h8x59qpou8g1/player
Step-GUI - Open GUI Automation
IC-Effect - Open Video Effects