Available for AI Consulting

Sahil Chachra

Architecting Intelligence, From the Ground Up.

AI Architect  ·  Building AI Platforms That Scale

Most engineers pick a lane. I build the stack end-to-end — from LLM orchestration and Multi-agent reasoning to VLM-powered perception systems running in real time at scale.

5+Years in AI & ML
600+Cameras in Production
30+CV Use Cases Built
2AI Stacks Built from Scratch

In Progress

Now

What I'm building, thinking about, and poking at right now.

Shipping

Real-time VLM surveillance pipeline

Multi-camera asyncio system where raw model events pass through a sliding-window persistence gate before they touch the backend — temporal confidence filtering as the anti-hallucination layer.

Cross-platform utility apps

Building Windows, macOS, and Linux desktop apps that onboard customers into the BLUE ecosystem — paired with contributions to a patented Intel VAAPI video compression pipeline.

MLX-quantized models for Apple Silicon

Publishing every quantization variant (affine 4/5/6/8-bit, mixed-bit, MX FP4/FP8) of Granite 4.1 8B and Hy-MT2 to HuggingFace — with benchmark reports comparing quality and throughput against FP16 baselines on M5 Pro.

Thinking About

Failure modes under degraded visual conditions

VLMs hallucinate more in low-light and noisy frames. What does a deterministic noise-rejection layer look like when the model itself is the noisy signal?

Where the edge–cloud inference boundary actually sits

Not a binary choice — it's a latency/cost/accuracy curve. I'm mapping where that curve bends for real-time multi-camera workloads.

Exploring

Temporal reasoning across frames

Single-frame inference misses context that spans seconds. Exploring how to pass temporal state to models that weren't designed for it.

Where mixed-bit beats uniform quantization

For translation vs. reasoning workloads, the optimal bit-width allocation differs significantly. Mapping when mixed4_6 is worth the complexity over straight 4-bit.

Work

Projects

Things I've built, shipped, and open-sourced.

LLM Safety Middleware

Featured

Most teams bolt safety on after deployment. This is a drop-in proxy that intercepts every LLM call — running configurable safety checks, content filters, and policy guardrails before the request reaches the model. Open-source, production-grade, zero changes to your existing LLM integration.

LLMsPythonFlaskSafetyProxyLocal LLM

VLM-Bench

Featured

Evaluating VLMs in production requires more than accuracy scores — you need throughput, latency, and cost curves too. Self-hosted benchmarking platform: upload image datasets, register HuggingFace or local models, run GPU-accelerated evaluation via vLLM, and compare the full performance profile on a live leaderboard.

PythonFastAPIReactTypeScriptvLLMDockerBERT ScoreSQLAlchemy

Refusal Fine-tuning

Featured

LLMs have a refusal problem — they refuse too much or not enough. This project maps the refusal decision boundary, then reshapes it through targeted fine-tuning. The goal: principled control over refusal behavior without degrading general capability.

PythonFine-tuningEvaluationHugging FaceAlignment

LLM Merging

When you merge two models each fine-tuned on a different domain, do you get a smarter generalist or a confused compromise? This project maps the answer empirically — testing weight interpolation, SLERP, and task vectors across domain-specialized checkpoints.

PythonModel MergingSLERPEmbeddingsResearch

Career

Experience

Where I've built things that matter.

AI Architect

Current
BLUE

Vision Analytics for Warehouses and Retail · Pre-Seed Stage

Apr 2026PresentBangalore, India

Joined a pre-seed startup building vision analytics for physical retail spaces. An AI stack existed — but it was expensive to run and didn't scale economically. My job: rebuild the pipeline to run on cheaper infrastructure while supporting far more cameras, inference throughput, and VLM API calls simultaneously.

  • Built a real-time multi-camera VLM pipeline from scratch — asyncio-native, with a sliding-window persistence gate that filters hallucinations before they reach the backend, cutting false alert rates without adding latency.
  • Contributing to a patented hardware-accelerated video compression algorithm (Intel VAAPI) and building cross-platform utility applications (Windows, macOS, Linux) that onboard customers into the BLUE ecosystem.
  • Profiled the full inference stack across quantization levels and GPU/edge targets — found and eliminated the bottlenecks preventing real-time SLA compliance.
VLMsVideo CompressionEdge AIasyncioIntel VAAPIvLLMLiteLLMPython

Founding AI Engineer

Stealth Startup

AI Platform for Factories · Pre-Seed Stage

Sep 2025Apr 2026Bangalore, India

First engineer hired at a pre-seed startup building AI for factory floors. The domain was manufacturing — CAD files, machine catalogs, compliance specs, and factory-floor constraints. The job was to make an LLM reason reliably across all of it, at production scale, with no prior art to follow.

  • Designed a 3-tier, 27-agent orchestration framework — role-based CrewAI agents, async direct-LLM agents, and an 11-stage DAG task executor — decomposing complex manufacturing tasks into verifiable, structured sub-problems.
  • Built a 7-stage document enrichment pipeline (extract → plan → classify → enrich → normalize → cross-doc → assemble) with conditional enrichers and concurrency semaphores handling 5 document modalities in parallel.
  • Integrated bidirectional MES sync with Odoo (JSON-RPC 2.0), connecting AI-generated factory plans to live shop floor data for planned-vs-actual dashboards — the first time those two worlds talked to each other.
LLMsMulti-Agent SystemsRAGCrewAIFastAPIPydanticFirestoreGCP

Visiting AI Mentor

CurrentFreelance

Mesa School of Business

PG & UG AI Hackathons · Neos Kosmos Technologies

Jul 2025PresentBangalore, India

Not everything I do is full-time. At Mesa I mentor student teams during intensive AI hackathons — helping founders and postgrads go from idea to working prototype using modern AI tooling, without needing deep engineering backgrounds.

  • Guided teams across PG and UG cohorts through hands-on builds using Google AI Studio, Relevance AI, and n8n — covering prompt design, agent workflows, and no-code/low-code AI automation.
  • Helped non-technical teams translate business problems into AI-powered prototypes within hackathon time constraints.
Google AI StudioRelevance AIn8nAI AgentsMentorshipNo-Code AI

Senior AI Engineer

Avathon

Computer Vision AI Platform · Series D

Dec 2022Aug 2025Bengaluru, India

Three years scaling a computer vision platform from early contracts to enterprise deployments across hundreds of cameras. I went from training models to owning system architecture — and eventually added LLMs to a stack that was purely CV when I joined.

  • Scaled a computer vision platform to 600+ cameras across enterprise clients, maintaining production reliability across 30+ active use cases simultaneously.
  • Built a person re-ID pipeline that tracked individual customer journeys across camera zones — measuring dwell time and staff engagement patterns that didn't exist in any other data source.
  • Engineered an LLM + RAG layer that translated natural language requirements directly into structured CV pipeline configurations, cutting new use-case deployment cycles by 20%.
  • Upgraded the platform with LLaVA and Qwen VLMs, achieving up to 95% accuracy improvement across 350 cameras.
PythonC++Computer VisionLLMsDeepStreamCUDAPyTorchYOLOMultimodal Models

Deep Learning Engineer

Tata Consultancy Services

IT Services & Consulting · Fortune 500

Jul 2021Dec 2022Remote

Started my career applying deep learning to real automotive and mobility problems. The fundamentals I built here — training discipline, production mindset, and understanding hardware constraints — shaped everything that followed.

  • Accelerated data annotation pipelines using deep learning models for the Smart Mobility Group, cutting manual labeling time on automotive datasets.
  • Integrated trained CV models into customer-facing products across mobility and automotive domains — one of the first times I shipped ML to real users.
  • Awarded the Technical Excellence Award for outstanding contributions to the organization.
PythonDeep LearningComputer VisionPyTorchData Annotation

HuggingFace

Open Models

I quantize open-source LLMs into every MLX variant and publish them for the community — so anyone on Apple Silicon can run frontier models locally without writing a line of quantization code.

14+

Models published

3

Base models ported

7

Quant formats per model

M5 Pro

Benchmarked on

Quantized with mlx-bench — a model-agnostic pipeline for MLX quantization and benchmarking. Point it at any HuggingFace repo, get every variant and a side-by-side perf/quality report.

Capabilities

Skills

The tools and technologies I work with every day.

LLMs·VLMs·Computer Vision·Multi-Agent Systems·RAG·Fine-tuning·Multimodal Models·MLOps·Embeddings·Video Compression·Local Inferencing·LLMs·VLMs·Computer Vision·Multi-Agent Systems·RAG·Fine-tuning·Multimodal Models·MLOps·Embeddings·Video Compression·Local Inferencing·
PyTorch·LangChain·FastAPI·OpenCV·CUDA·DeepStream·CrewAI·Python·C++·PyTorch·LangChain·FastAPI·OpenCV·CUDA·DeepStream·CrewAI·Python·C++·
GCP·Docker·Redis·QDrant·Flask·TensorRT·NVIDIA Jetson·GitHub Actions·Claude Code·GCP·Docker·Redis·QDrant·Flask·TensorRT·NVIDIA Jetson·GitHub Actions·Claude Code·

Drag to scrub · Hover to highlight

Get In Touch

Let's Build Something

Open for AI consulting, platform architecture reviews, and genuinely interesting problems. If you're building at the frontier of AI, I'd love to hear about it.

or find me on