Available for AI Consulting

Sahil Chachra

“Architecting Intelligence, From the Ground Up.”

AI Architect · Building AI Platforms That Scale

Most engineers pick a lane. I build the stack end-to-end — from LLM orchestration and Multi-agent reasoning to VLM-powered perception systems running in real time at scale.

5+Years in AI & ML

600+Cameras in Production

30+CV Use Cases Built

2AI Stacks Built from Scratch

In Progress

Now

What I'm building, thinking about, and poking at right now.

Shipping

Real-time VLM surveillance pipeline

Multi-camera asyncio system where raw model events pass through a sliding-window persistence gate before they touch the backend — temporal confidence filtering as the anti-hallucination layer.

Quantization across formats and use cases

Exploring how affine, MX FP4/FP8, and OptiQ quants behave differently across reasoning, translation, and code workloads — and fine-tuning small models (1B–8B) with LoRA to close the gap where quantization hurts most.

Thinking About

Failure modes under degraded visual conditions

VLMs hallucinate more in low-light and noisy frames. What does a deterministic noise-rejection layer look like when the model itself is the noisy signal?

Where the edge–cloud inference boundary actually sits

Not a binary choice — it's a latency/cost/accuracy curve. I'm mapping where that curve bends for real-time multi-camera workloads.

Exploring

Temporal reasoning across frames

Single-frame inference misses context that spans seconds. Exploring how to pass temporal state to models that weren't designed for it.

Where mixed-bit beats uniform quantization

For translation vs. reasoning workloads, the optimal bit-width allocation differs significantly. Mapping when mixed4_6 is worth the complexity over straight 4-bit.

HuggingFace

Open Models

Quantizing and fine-tuning open-source models for the community — every variant benchmarked and published to HuggingFace, ready to run.

60+

Total downloads

60+

Models published

MLX · AWQ · NVFP4 · OptiQ

Quant formats

M4 Pro · M5 Pro

Benchmarked on

Downloads by Model

Also Published

Find Your Model

Quantized with mlx-bench — a model-agnostic pipeline for MLX quantization and benchmarking across affine, MX, and OptiQ formats.

Career

Experience

Where I've built things that matter.

AI Architect

Current

BLUE

Vision Analytics for Warehouses and Retail · Pre-Seed Stage

Apr 2026 — PresentBangalore, India

Joined a pre-seed startup building vision analytics for physical retail spaces. An AI stack existed — but it was expensive to run and didn't scale economically. My job: rebuild the pipeline to run on cheaper infrastructure while supporting far more cameras, inference throughput, and VLM API calls simultaneously.

▸Built a real-time multi-camera VLM pipeline from scratch — asyncio-native, with a sliding-window persistence gate that filters hallucinations before they reach the backend, cutting false alert rates without adding latency.
▸Contributing to a patented hardware-accelerated video compression algorithm (Intel VAAPI) and building cross-platform utility applications (Windows, macOS, Linux) that onboard customers into the BLUE ecosystem.
▸Profiled the full inference stack across quantization levels and GPU/edge targets — found and eliminated the bottlenecks preventing real-time SLA compliance.

VLMsVideo CompressionEdge AIasyncioIntel VAAPIvLLMLiteLLMPython

Founding AI Engineer

Stealth Startup

AI Platform for Factories · Pre-Seed Stage

Sep 2025 — Apr 2026Bangalore, India

First engineer hired at a pre-seed startup building AI for factory floors. The domain was manufacturing — CAD files, machine catalogs, compliance specs, and factory-floor constraints. The job was to make an LLM reason reliably across all of it, at production scale, with no prior art to follow.

▸Designed a 3-tier, 27-agent orchestration framework — role-based CrewAI agents, async direct-LLM agents, and an 11-stage DAG task executor — decomposing complex manufacturing tasks into verifiable, structured sub-problems.
▸Built a 7-stage document enrichment pipeline (extract → plan → classify → enrich → normalize → cross-doc → assemble) with conditional enrichers and concurrency semaphores handling 5 document modalities in parallel.
▸Integrated bidirectional MES sync with Odoo (JSON-RPC 2.0), connecting AI-generated factory plans to live shop floor data for planned-vs-actual dashboards — the first time those two worlds talked to each other.

LLMsMulti-Agent SystemsRAGCrewAIFastAPIPydanticFirestoreGCP

Visiting AI Mentor

CurrentFreelance

Mesa School of Business

PG & UG AI Hackathons · Neos Kosmos Technologies

Jul 2025 — PresentBangalore, India

Not everything I do is full-time. At Mesa I mentor student teams during intensive AI hackathons — helping founders and postgrads go from idea to working prototype using modern AI tooling, without needing deep engineering backgrounds.

▸Guided teams across PG and UG cohorts through hands-on builds using Google AI Studio, Relevance AI, and n8n — covering prompt design, agent workflows, and no-code/low-code AI automation.
▸Helped non-technical teams translate business problems into AI-powered prototypes within hackathon time constraints.

Google AI StudioRelevance AIn8nAI AgentsMentorshipNo-Code AI

Senior AI Engineer

Avathon

Computer Vision AI Platform · Series D

Dec 2022 — Aug 2025Bengaluru, India

Three years scaling a computer vision platform from early contracts to enterprise deployments across hundreds of cameras. I went from training models to owning system architecture — and eventually added LLMs to a stack that was purely CV when I joined.

▸Scaled a computer vision platform to 600+ cameras across enterprise clients, maintaining production reliability across 30+ active use cases simultaneously.
▸Built a person re-ID pipeline that tracked individual customer journeys across camera zones — measuring dwell time and staff engagement patterns that didn't exist in any other data source.
▸Engineered an LLM + RAG layer that translated natural language requirements directly into structured CV pipeline configurations, cutting new use-case deployment cycles by 20%.
▸Upgraded the platform with LLaVA and Qwen VLMs, achieving up to 95% accuracy improvement across 350 cameras.

PythonC++Computer VisionLLMsDeepStreamCUDAPyTorchYOLOMultimodal Models

Deep Learning Engineer

Tata Consultancy Services

IT Services & Consulting · Fortune 500

Jul 2021 — Dec 2022Remote

Started my career applying deep learning to real automotive and mobility problems. The fundamentals I built here — training discipline, production mindset, and understanding hardware constraints — shaped everything that followed.

▸Accelerated data annotation pipelines using deep learning models for the Smart Mobility Group, cutting manual labeling time on automotive datasets.
▸Integrated trained CV models into customer-facing products across mobility and automotive domains — one of the first times I shipped ML to real users.
▸Awarded the Technical Excellence Award for outstanding contributions to the organization.

PythonDeep LearningComputer VisionPyTorchData Annotation

Capabilities

Skills

The tools and technologies I work with every day.

LLMs·VLMs·Computer Vision·Multi-Agent Systems·RAG·Fine-tuning·Multimodal Models·MLOps·Embeddings·Video Compression·Local Inferencing·LLMs·VLMs·Computer Vision·Multi-Agent Systems·RAG·Fine-tuning·Multimodal Models·MLOps·Embeddings·Video Compression·Local Inferencing·

PyTorch·LangChain·FastAPI·OpenCV·CUDA·DeepStream·CrewAI·Python·C++·PyTorch·LangChain·FastAPI·OpenCV·CUDA·DeepStream·CrewAI·Python·C++·

GCP·Docker·Redis·QDrant·Flask·TensorRT·NVIDIA Jetson·GitHub Actions·Claude Code·GCP·Docker·Redis·QDrant·Flask·TensorRT·NVIDIA Jetson·GitHub Actions·Claude Code·

Drag to scrub · Hover to highlight

Medium

Writing

Thoughts on AI, engineering craft, and what it takes to build at the frontier.

Fine-Tuning LLMs for Refusal

Image Generated using Nano Banana Modern instruction-tuned LLMs often fail not because they lack knowledge , but because they: Answer when they should not…

Jan 20265 min read

Read on Medium

Merged > Fine-Tuned? A Case Study on Qwen3 and Domain Fusion

I wanted to explore model merging using mergekit library but found that the library doesn’t support Qwen3 architectures yet! So I went ahead and tried few…

Jul 202510 min read

Read on Medium

Common questions while working with Large Language Models

Image source — Link So lately I’ve been spending time in digging the basics of LLMs to get a good brief overview of the basics. These question are…

Mar 202412 min read

Read on Medium

Paper Summary — Deep Leakage from Gradients

Paper Summary — Deep Leakage from Gradients Paper Link — https://arxiv.org/abs/1906.08935 Why this topic? While performing distributed training to speed up…

Dec 20237 min read

Read on Medium

Get In Touch

Let's Build Something

Open for AI consulting, platform architecture reviews, and genuinely interesting problems. If you're building at the frontier of AI, I'd love to hear about it.

or find me on

GitHub HuggingFace LinkedIn Twitter / X Medium Email