What Isn't Scaling in Local LLMs

Community findings

🧠

Multi-step Reasoning Depth

"Intelligence" compresses fine — knowing facts scales to 7B. But chained logical reasoning with intermediate state doesn't. Small models confidently hallucinate by step 3 on hard debugging chains that a 70B walks through cleanly.

u/sysflux

📄

Effective Context Window

128k token context is a marketing number. The actual reliable retrieval-and-reasoning window in small models has barely moved. The gap between context length spec and actual usable context is growing, not shrinking.

u/sysflux

🌍

Physical World Coherence

Small models don't have robust implicit world models. Classic test: "if you haven't removed your shoes, you can't take off just your socks." Only higher-parameter models reliably handle these implicit physical constraints — likely tied to knowledge density, not reasoning per se.

u/ArsNeph

📦

Factual Knowledge Density

Intelligence and knowledge are decoupling. Small models are more "intelligent" but hallucinate facts constantly. The ~700B–1T range is where LLMs become genuinely reliable on common knowledge without RAG augmentation.

u/TechnoByte_

🗂️

Structured Output Fidelity

A 7B can write a convincing essay but fails at consistently producing valid nested JSON schemas. The "intelligence" compresses — the precision required to maintain strict structural contracts across a full output doesn't.

u/GroundbreakingMall54

💾

MoE Memory Overhead

The MoE pivot trades VRAM (the limiting resource for local runners) for training compute savings. For maximizing inference quality per GB of memory, dense models remain optimal — the community is rediscovering this with Qwen3-27B dense.

u/ttkciar

Thread deep-dive · MoE & sparsity

The MoE Sparsity Problem

The most technically rich thread in this discussion was between u/ttkciar and u/ROS_SDN, who identified a real tension in how current MoE models are being deployed locally.

"I feel the issue is that most local MoEs are being pushed too sparse. 8% sparsity at 1T is still 80B active, which is likely a strong model itself. But bring that to Qwen3.5-35B and that's 3B active — I just feel 3B is such a small number to trust." — u/ROS_SDN

The community's napkin heuristic for comparing MoE to dense is sqrt(Total × Active). For Qwen3.5-35B-A3B: sqrt(35B × 3B) ≈ 10B effective intelligence. For a 27B dense: 27B. That's a meaningful gap, and u/ttkciar correctly notes that this heuristic is incomplete — it misses the expert-picking quality term entirely.

The key insight: MoE gating logic has historically been good at routing to experts that hold memorized factual knowledge but poor at routing to experts encoding generalized heuristics (instruction following, reasoning chains). This gap has narrowed in recent generations — GLM-4.5-Air is cited as evidence that heuristic routing has improved meaningfully.

My analysis · Beyond the thread

Intelligence vs. Knowledge is the real schism

The thread correctly identifies this but doesn't fully separate the two. Intelligence — the ability to follow instructions, reason step-by-step, structure outputs — is compressing very well with distillation. Knowledge — the stored, retrievable factual substrate — is not compressing at the same rate because it scales with parameter count in a fundamentally different way. A distilled 7B inherits reasoning patterns from its teacher but cannot inherit its parameter-embedded knowledge. This is why RAG remains non-negotiable for local deployments doing anything knowledge-intensive.

Context utilization is architecturally under-addressed

The gap between marketed context length and actual reliable attention range is one of the most understated issues in local LLMs. RoPE scaling and ALiBi give models long windows but don't fix the attention pattern degradation that causes "lost in the middle" failures. The problem isn't primarily model size — it's that attention heads trained on short sequences don't generalize to full-document retrieval tasks. This is an architectural problem that smaller, denser, or sparser models all suffer from in similar proportions.

Quantization hurts MoE more than dense, proportionally

When you quantize a dense 27B to Q4, you degrade 27B parameters. When you quantize a 35B-A3B MoE to Q4, you degrade 35B parameters but only 3B of them are active per token. The active weights are doing all the real work, and they're being quantized from an already small pool — the rounding noise has a proportionally larger effect on output quality. The u/ROS_SDN concern about "not trusting quantized small active weights" is spot on and under-discussed in quantization guides. Q8 on a sparse MoE is not the same quality trade-off as Q8 on a dense model.

Structured output failures are a failure of consistency, not intelligence

The JSON fidelity problem is interesting because it's not that small models don't understand JSON schema — they do. The failure is in maintaining constraints across long token sequences. This is a working-memory analog: the model knows the schema rules but "forgets" them by token 200. This is solvable with constrained decoding (llama.cpp grammar mode, Outlines, structured outputs) — which means it's more of an ecosystem/tooling gap than a fundamental model limitation.

The Densing Law and MoE are on a collision course

If capability per parameter doubles every ~3.5 months, the question becomes: which model type captures that gain most efficiently for local runners? MoE wins on training economics and absolute speed. Dense wins on VRAM efficiency and quality-per-GB at quantized inference. As RAM constraints tighten (RAMageddon, as u/ttkciar calls it), the community will push back toward dense models for quality-sensitive work — we're already seeing this with the Qwen3-27B dense resurgence. The ideal future isn't one model type winning — it's tunable sparsity at inference time, which nobody has shipped reliably yet.

Model selection · When to pick what

MoE

Mixture of Experts

Interactive chat requiring fast token throughput
Latency-sensitive agentic loops with many short calls
RAM-constrained setups running large effective parameter counts
Tasks where speed matters more than peak accuracy
Broad general assistant use: drafting, summarization, light coding

⚠ Avoid for: precision-heavy structured output, hard multi-step reasoning, or heavy quantization scenarios

Dense · Small

7B–14B Dense

Edge / embedded deployment where VRAM is under 8GB
Simple instruction following and template filling
High-throughput batch classification pipelines
Quick semantic search re-ranking
Always pair with RAG — do not rely on internal knowledge

⚠ Avoid for: open-ended reasoning, long context retrieval, creative writing requiring world coherence

Dense · Large

27B–70B Dense

Complex multi-step reasoning and debugging chains
Strict JSON / structured output without constrained decoding
Long document analysis where context fidelity matters
Creative writing requiring physical/logical world coherence
Quality-first, latency-tolerant batch inference

⚠ Trade-off: higher VRAM cost, slower tokens/sec — but highest quality-per-GB at Q4/Q8

How Sparsity Shapes Output & Knowledge

Sparsity in MoE means that for any given token, only a small fraction of the model's total parameters are activated — selected by a gating network that routes each token to its most relevant experts.

Effective "intelligence" ≈ sqrt(Total_params × Active_params)
// Qwen3.5-35B-A3B: sqrt(35B × 3B) ≈ 10.2B effective
// Qwen3-27B dense: sqrt(27B × 27B) = 27B effective
// Missing term: gating quality (0.0 → 1.0)

What sparsity costs

When only 3B of 35B parameters activate per token, the model is essentially running a dynamic 3B dense model per forward pass — but one chosen by the gating logic. If the gate misroutes even one or two experts, you lose coverage of a specific skill domain entirely for that token. This is why sparse models can seem surprisingly competent most of the time but fail catastrophically on edge cases: the relevant expert simply wasn't activated.

Knowledge is distributed, not localized

Factual knowledge in LLMs isn't cleanly separated into "one expert per domain." It's distributed across many experts. Sparse models with low active counts risk partial knowledge retrieval — they activate experts that know adjacent to the correct answer but not the specific experts encoding the exact fact. This is why small sparse models hallucinate more than their effective parameter count would predict: the gating is routing to near-neighbors, not the right experts.

Where sparsity works perfectly

High sparsity works best when the task domain is narrow and well-represented in training — code completion, instruction following, summarization. The gating network has learned clean routing for common task types. It breaks down on long-tail knowledge, implicit physical reasoning, and cross-domain synthesis — exactly the tasks that already struggle at small scale.

The quantization compounding problem

Quantizing a sparse MoE to Q4 means the already-small active weight pool is being represented with 4-bit precision. Rounding errors in 3B active weights affect 100% of the active computation. The same Q4 applied to a 27B dense model affects a pool 9× larger, so errors average out more gracefully. This is why Q4 MoE feels noticeably worse than Q4 dense at comparable active parameter counts — the math is unforgiving at small active sizes.

What Isn't Scalingin Local LLMs