A breakdown of the r/LocalLLaMA thread on which aspects of local inference refuse to compress โ followed by a deeper analysis of MoE vs. dense model tradeoffs and how sparsity shapes output quality.
"Intelligence" compresses fine โ knowing facts scales to 7B. But chained logical reasoning with intermediate state doesn't. Small models confidently hallucinate by step 3 on hard debugging chains that a 70B walks through cleanly.
128k token context is a marketing number. The actual reliable retrieval-and-reasoning window in small models has barely moved. The gap between context length spec and actual usable context is growing, not shrinking.
Small models don't have robust implicit world models. Classic test: "if you haven't removed your shoes, you can't take off just your socks." Only higher-parameter models reliably handle these implicit physical constraints โ likely tied to knowledge density, not reasoning per se.
Intelligence and knowledge are decoupling. Small models are more "intelligent" but hallucinate facts constantly. The ~700Bโ1T range is where LLMs become genuinely reliable on common knowledge without RAG augmentation.
A 7B can write a convincing essay but fails at consistently producing valid nested JSON schemas. The "intelligence" compresses โ the precision required to maintain strict structural contracts across a full output doesn't.
The MoE pivot trades VRAM (the limiting resource for local runners) for training compute savings. For maximizing inference quality per GB of memory, dense models remain optimal โ the community is rediscovering this with Qwen3-27B dense.
The most technically rich thread in this discussion was between u/ttkciar and u/ROS_SDN, who identified a real tension in how current MoE models are being deployed locally.
The community's napkin heuristic for comparing MoE to dense is sqrt(Total ร Active). For Qwen3.5-35B-A3B:
sqrt(35B ร 3B) โ 10B effective intelligence. For a 27B dense: 27B.
That's a meaningful gap, and u/ttkciar correctly notes that this heuristic is incomplete โ
it misses the expert-picking quality term entirely.
The key insight: MoE gating logic has historically been good at routing to experts that hold memorized factual knowledge but poor at routing to experts encoding generalized heuristics (instruction following, reasoning chains). This gap has narrowed in recent generations โ GLM-4.5-Air is cited as evidence that heuristic routing has improved meaningfully.
The thread correctly identifies this but doesn't fully separate the two. Intelligence โ the ability to follow instructions, reason step-by-step, structure outputs โ is compressing very well with distillation. Knowledge โ the stored, retrievable factual substrate โ is not compressing at the same rate because it scales with parameter count in a fundamentally different way. A distilled 7B inherits reasoning patterns from its teacher but cannot inherit its parameter-embedded knowledge. This is why RAG remains non-negotiable for local deployments doing anything knowledge-intensive.
The gap between marketed context length and actual reliable attention range is one of the most understated issues in local LLMs. RoPE scaling and ALiBi give models long windows but don't fix the attention pattern degradation that causes "lost in the middle" failures. The problem isn't primarily model size โ it's that attention heads trained on short sequences don't generalize to full-document retrieval tasks. This is an architectural problem that smaller, denser, or sparser models all suffer from in similar proportions.
When you quantize a dense 27B to Q4, you degrade 27B parameters. When you quantize a 35B-A3B MoE to Q4, you degrade 35B parameters but only 3B of them are active per token. The active weights are doing all the real work, and they're being quantized from an already small pool โ the rounding noise has a proportionally larger effect on output quality. The u/ROS_SDN concern about "not trusting quantized small active weights" is spot on and under-discussed in quantization guides. Q8 on a sparse MoE is not the same quality trade-off as Q8 on a dense model.
The JSON fidelity problem is interesting because it's not that small models don't understand JSON schema โ they do. The failure is in maintaining constraints across long token sequences. This is a working-memory analog: the model knows the schema rules but "forgets" them by token 200. This is solvable with constrained decoding (llama.cpp grammar mode, Outlines, structured outputs) โ which means it's more of an ecosystem/tooling gap than a fundamental model limitation.
If capability per parameter doubles every ~3.5 months, the question becomes: which model type captures that gain most efficiently for local runners? MoE wins on training economics and absolute speed. Dense wins on VRAM efficiency and quality-per-GB at quantized inference. As RAM constraints tighten (RAMageddon, as u/ttkciar calls it), the community will push back toward dense models for quality-sensitive work โ we're already seeing this with the Qwen3-27B dense resurgence. The ideal future isn't one model type winning โ it's tunable sparsity at inference time, which nobody has shipped reliably yet.
โ Avoid for: precision-heavy structured output, hard multi-step reasoning, or heavy quantization scenarios
โ Avoid for: open-ended reasoning, long context retrieval, creative writing requiring world coherence
โ Trade-off: higher VRAM cost, slower tokens/sec โ but highest quality-per-GB at Q4/Q8
Sparsity in MoE means that for any given token, only a small fraction of the model's total parameters are activated โ selected by a gating network that routes each token to its most relevant experts.
When only 3B of 35B parameters activate per token, the model is essentially running a dynamic 3B dense model per forward pass โ but one chosen by the gating logic. If the gate misroutes even one or two experts, you lose coverage of a specific skill domain entirely for that token. This is why sparse models can seem surprisingly competent most of the time but fail catastrophically on edge cases: the relevant expert simply wasn't activated.
Factual knowledge in LLMs isn't cleanly separated into "one expert per domain." It's distributed across many experts. Sparse models with low active counts risk partial knowledge retrieval โ they activate experts that know adjacent to the correct answer but not the specific experts encoding the exact fact. This is why small sparse models hallucinate more than their effective parameter count would predict: the gating is routing to near-neighbors, not the right experts.
High sparsity works best when the task domain is narrow and well-represented in training โ code completion, instruction following, summarization. The gating network has learned clean routing for common task types. It breaks down on long-tail knowledge, implicit physical reasoning, and cross-domain synthesis โ exactly the tasks that already struggle at small scale.
Quantizing a sparse MoE to Q4 means the already-small active weight pool is being represented with 4-bit precision. Rounding errors in 3B active weights affect 100% of the active computation. The same Q4 applied to a 27B dense model affects a pool 9ร larger, so errors average out more gracefully. This is why Q4 MoE feels noticeably worse than Q4 dense at comparable active parameter counts โ the math is unforgiving at small active sizes.