r/LocalLLaMA ยท Community Analysis

What Isn't Scaling
in Local LLMs

A breakdown of the r/LocalLLaMA thread on which aspects of local inference refuse to compress โ€” followed by a deeper analysis of MoE vs. dense model tradeoffs and how sparsity shapes output quality.

Community findings
๐Ÿง 
Multi-step Reasoning Depth

"Intelligence" compresses fine โ€” knowing facts scales to 7B. But chained logical reasoning with intermediate state doesn't. Small models confidently hallucinate by step 3 on hard debugging chains that a 70B walks through cleanly.

u/sysflux
๐Ÿ“„
Effective Context Window

128k token context is a marketing number. The actual reliable retrieval-and-reasoning window in small models has barely moved. The gap between context length spec and actual usable context is growing, not shrinking.

u/sysflux
๐ŸŒ
Physical World Coherence

Small models don't have robust implicit world models. Classic test: "if you haven't removed your shoes, you can't take off just your socks." Only higher-parameter models reliably handle these implicit physical constraints โ€” likely tied to knowledge density, not reasoning per se.

u/ArsNeph
๐Ÿ“ฆ
Factual Knowledge Density

Intelligence and knowledge are decoupling. Small models are more "intelligent" but hallucinate facts constantly. The ~700Bโ€“1T range is where LLMs become genuinely reliable on common knowledge without RAG augmentation.

u/TechnoByte_
๐Ÿ—‚๏ธ
Structured Output Fidelity

A 7B can write a convincing essay but fails at consistently producing valid nested JSON schemas. The "intelligence" compresses โ€” the precision required to maintain strict structural contracts across a full output doesn't.

u/GroundbreakingMall54
๐Ÿ’พ
MoE Memory Overhead

The MoE pivot trades VRAM (the limiting resource for local runners) for training compute savings. For maximizing inference quality per GB of memory, dense models remain optimal โ€” the community is rediscovering this with Qwen3-27B dense.

u/ttkciar
Thread deep-dive ยท MoE & sparsity

The MoE Sparsity Problem

The most technically rich thread in this discussion was between u/ttkciar and u/ROS_SDN, who identified a real tension in how current MoE models are being deployed locally.

"I feel the issue is that most local MoEs are being pushed too sparse. 8% sparsity at 1T is still 80B active, which is likely a strong model itself. But bring that to Qwen3.5-35B and that's 3B active โ€” I just feel 3B is such a small number to trust." โ€” u/ROS_SDN

The community's napkin heuristic for comparing MoE to dense is sqrt(Total ร— Active). For Qwen3.5-35B-A3B: sqrt(35B ร— 3B) โ‰ˆ 10B effective intelligence. For a 27B dense: 27B. That's a meaningful gap, and u/ttkciar correctly notes that this heuristic is incomplete โ€” it misses the expert-picking quality term entirely.

The key insight: MoE gating logic has historically been good at routing to experts that hold memorized factual knowledge but poor at routing to experts encoding generalized heuristics (instruction following, reasoning chains). This gap has narrowed in recent generations โ€” GLM-4.5-Air is cited as evidence that heuristic routing has improved meaningfully.

My analysis ยท Beyond the thread
01

Intelligence vs. Knowledge is the real schism

The thread correctly identifies this but doesn't fully separate the two. Intelligence โ€” the ability to follow instructions, reason step-by-step, structure outputs โ€” is compressing very well with distillation. Knowledge โ€” the stored, retrievable factual substrate โ€” is not compressing at the same rate because it scales with parameter count in a fundamentally different way. A distilled 7B inherits reasoning patterns from its teacher but cannot inherit its parameter-embedded knowledge. This is why RAG remains non-negotiable for local deployments doing anything knowledge-intensive.

02

Context utilization is architecturally under-addressed

The gap between marketed context length and actual reliable attention range is one of the most understated issues in local LLMs. RoPE scaling and ALiBi give models long windows but don't fix the attention pattern degradation that causes "lost in the middle" failures. The problem isn't primarily model size โ€” it's that attention heads trained on short sequences don't generalize to full-document retrieval tasks. This is an architectural problem that smaller, denser, or sparser models all suffer from in similar proportions.

03

Quantization hurts MoE more than dense, proportionally

When you quantize a dense 27B to Q4, you degrade 27B parameters. When you quantize a 35B-A3B MoE to Q4, you degrade 35B parameters but only 3B of them are active per token. The active weights are doing all the real work, and they're being quantized from an already small pool โ€” the rounding noise has a proportionally larger effect on output quality. The u/ROS_SDN concern about "not trusting quantized small active weights" is spot on and under-discussed in quantization guides. Q8 on a sparse MoE is not the same quality trade-off as Q8 on a dense model.

04

Structured output failures are a failure of consistency, not intelligence

The JSON fidelity problem is interesting because it's not that small models don't understand JSON schema โ€” they do. The failure is in maintaining constraints across long token sequences. This is a working-memory analog: the model knows the schema rules but "forgets" them by token 200. This is solvable with constrained decoding (llama.cpp grammar mode, Outlines, structured outputs) โ€” which means it's more of an ecosystem/tooling gap than a fundamental model limitation.

05

The Densing Law and MoE are on a collision course

If capability per parameter doubles every ~3.5 months, the question becomes: which model type captures that gain most efficiently for local runners? MoE wins on training economics and absolute speed. Dense wins on VRAM efficiency and quality-per-GB at quantized inference. As RAM constraints tighten (RAMageddon, as u/ttkciar calls it), the community will push back toward dense models for quality-sensitive work โ€” we're already seeing this with the Qwen3-27B dense resurgence. The ideal future isn't one model type winning โ€” it's tunable sparsity at inference time, which nobody has shipped reliably yet.

MoE
Mixture of Experts
  • Interactive chat requiring fast token throughput
  • Latency-sensitive agentic loops with many short calls
  • RAM-constrained setups running large effective parameter counts
  • Tasks where speed matters more than peak accuracy
  • Broad general assistant use: drafting, summarization, light coding

โš  Avoid for: precision-heavy structured output, hard multi-step reasoning, or heavy quantization scenarios

Dense ยท Small
7Bโ€“14B Dense
  • Edge / embedded deployment where VRAM is under 8GB
  • Simple instruction following and template filling
  • High-throughput batch classification pipelines
  • Quick semantic search re-ranking
  • Always pair with RAG โ€” do not rely on internal knowledge

โš  Avoid for: open-ended reasoning, long context retrieval, creative writing requiring world coherence

Dense ยท Large
27Bโ€“70B Dense
  • Complex multi-step reasoning and debugging chains
  • Strict JSON / structured output without constrained decoding
  • Long document analysis where context fidelity matters
  • Creative writing requiring physical/logical world coherence
  • Quality-first, latency-tolerant batch inference

โš  Trade-off: higher VRAM cost, slower tokens/sec โ€” but highest quality-per-GB at Q4/Q8

How Sparsity Shapes Output & Knowledge

Sparsity in MoE means that for any given token, only a small fraction of the model's total parameters are activated โ€” selected by a gating network that routes each token to its most relevant experts.

Effective "intelligence" โ‰ˆ sqrt(Total_params ร— Active_params)
// Qwen3.5-35B-A3B: sqrt(35B ร— 3B) โ‰ˆ 10.2B effective
// Qwen3-27B dense: sqrt(27B ร— 27B) = 27B effective
// Missing term: gating quality (0.0 โ†’ 1.0)

What sparsity costs

When only 3B of 35B parameters activate per token, the model is essentially running a dynamic 3B dense model per forward pass โ€” but one chosen by the gating logic. If the gate misroutes even one or two experts, you lose coverage of a specific skill domain entirely for that token. This is why sparse models can seem surprisingly competent most of the time but fail catastrophically on edge cases: the relevant expert simply wasn't activated.

Knowledge is distributed, not localized

Factual knowledge in LLMs isn't cleanly separated into "one expert per domain." It's distributed across many experts. Sparse models with low active counts risk partial knowledge retrieval โ€” they activate experts that know adjacent to the correct answer but not the specific experts encoding the exact fact. This is why small sparse models hallucinate more than their effective parameter count would predict: the gating is routing to near-neighbors, not the right experts.

Where sparsity works perfectly

High sparsity works best when the task domain is narrow and well-represented in training โ€” code completion, instruction following, summarization. The gating network has learned clean routing for common task types. It breaks down on long-tail knowledge, implicit physical reasoning, and cross-domain synthesis โ€” exactly the tasks that already struggle at small scale.

The quantization compounding problem

Quantizing a sparse MoE to Q4 means the already-small active weight pool is being represented with 4-bit precision. Rounding errors in 3B active weights affect 100% of the active computation. The same Q4 applied to a 27B dense model affects a pool 9ร— larger, so errors average out more gracefully. This is why Q4 MoE feels noticeably worse than Q4 dense at comparable active parameter counts โ€” the math is unforgiving at small active sizes.