Shabby Lemmy

German researchers achieved 71.6% on ARC-AGI using a regular GPU for 2 cents per task. OpenAI's o3 gets 87% but costs $17 per task making it 850x more expensive.

https://arxiv.org/abs/2505.07859

That score is seriously impressive because it actually beats the average human performance of 60.2% and completely changes the narrative that you need massive proprietary models to do abstract reasoning. They used a fine-tuned version of Mistral-NeMo-Minitron-8B and brought the inference cost down to an absurdly cheap level compared to OpenAI's o3 model.

The methodology is really clever because they started by nuking the standard tokenizer and stripping it down to just 64 tokens to stop the model from accidentally merging digits and confusing itself. They also leaned heavily on test-time training where the model fine-tunes itself on the few example pairs of a specific puzzle for a few seconds before trying to solve the test input. For the actual generation they ditched standard sampling for a depth-first search that prunes low-probability paths early so they do not waste compute on obvious dead ends.

The most innovative part of the paper is their Product of Experts selection strategy. Once the model generates a candidate solution they do not just trust it blindly. They take that solution and re-evaluate its probability across different augmentations of the input like rotating the grid or swapping colors. If the solution is actually correct it should look plausible from every perspective so they calculate the geometric mean of those probabilities to filter out hallucinations. It is basically like the model peer reviewing its own work by looking at the problem from different angles to make sure the logic holds up.

What's remarkable is that all of this was done with smart engineering rather than raw compute. You can literally run this tonight on your own machine.

The code is fully open-source: https://github.com/da-fr/Product-of-Experts-ARC-Paper

JoeByeThen [he/him, they/them] - 2w

Nice! I keep seeing unsloth being used, guess I gotta learn what the heck that is.

☆ Yσɠƚԋσʂ ☆ - 2w

So, unsloth is basically an optimization hack for fine tuning LLMs that got popular because it solves the headaches of running out of VRAM and waiting forever for training to finish. Using this library makes it possible to finetune models on a consumer GPUs. And it’s essentially a drop in replacement for the standard Hugging Face transformers + peft stack. The api is designed to look almost exactly like Hugging Face, so you just change your import from AutoModelForCausalLM to FastLanguageModel and you're pretty much good to go.

# Instead of this:
# from transformers import AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

# You do this:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-bnb-4bit", # Pre-quantized for speed
    max_seq_length = 2048,
    load_in_4bit = True,
)

But under the hood, it’s doing something much smarter than standard PyTorch, and the secret sauce is actually pretty interesting from a programming perspective. Standard PyTorch relies on an autograd engine to handle backpropagation, which is great for flexibility but heavy on memory because it has to cache intermediate activations. The guys who built unsloth looked at the transformer architecture and manually derived the backward pass steps mathematically. Since they aren't relying on the generic autograd engine, they stripped out a ton of overhead. The result is that you get fine tuning that is about 2 to 5x faster and uses roughly half the memory, without losing any accuracy.

JoeByeThen [he/him, they/them] - 2w

Huh, good to know, thanks! One day I'll move beyond my 1080 and get back into the nitty gritty. As it stands now I'm trying to find the time to properly use my ollama that's wired into n8n to automate a bunch of my home productivity stuff. Feeling really old and slow with how quick this stuff is happening nowadays.

☆ Yσɠƚԋσʂ ☆ - 2w

It's pretty hard to keep up with. I find I tend to wait till things make it to mainstream stuff like ollama as well. The effort of setting up something custom is usually not worth it cause it'll probably be all obsolete in a few months anyways. There's basically a lot of low hanging fruit in terms of optimizations that people are discovering, and we'll probably see things moving really fast for the next few years, but once all the easy improvements are plucked, things will start stabilizing.