Stop Guessing the Stack.
Before You Spend Crores on On-Prem AI.
Most enterprise AI projects in India fail not at deployment, but at the architecture decision two months earlier. The wrong model. The wrong hosting choice. The wrong GPU sizing. By the time it shows up as wasted hardware capex, runaway API bills, or a stalled compliance review, you're already six figures in. BiltIQ AI Factory Advisory helps Indian CIOs, CTOs, and boards make the call before they commit hardware, headcount, or a 3-year vendor contract — at half the price of what Big 4 and Indian AI specialist firms charge for the same answer.
Three Expensive Mistakes We Stop You From Making
Picking the Model First, Finding the Use Case Later
Teams sign up for a foundation model subscription, build a demo, then realize the workload doesn't justify the price — or the model can't actually do the job. Stack decisions made on hype, not evaluation, routinely waste ₹40–60L in the first year.
Cloud-First Reflex on Sensitive Data
The default is to call an external API because it's fast to ship. Then compliance, legal, or the board ask where customer data is going. Months of rework follow. Per-query costs compound. IP exits the building with every prompt.
On-Premise Because It Sounds Safer
Buying a GPU box without an inference engine, evaluation harness, or sizing model is how on-prem AI projects die quietly. Hardware shows up. Throughput is 3 tokens/sec for 5 users. The CFO asks what happened.
The Four Axes — Score Every AI Use Case
No single answer fits all AI workloads inside one company. BiltIQ evaluates each candidate use case across four orthogonal axes. The score vector — not opinion — drives the deployment tier.
Data Sensitivity
PII, PHI, financial records, regulated content, trade secrets, board-level IP. The higher the regulatory or competitive exposure, the harder the case for on-premise.
Volume Economics
Queries per month × tokens per query. Above roughly 500K queries/month, on-prem TCO beats API spend within 8–14 months. Below 50K queries/month, cloud wins on simplicity.
Latency & Throughput
Sub-200ms interactive chat needs either a frontier-grade hosted API or properly sized GPU capacity. Batch, async, and document processing tolerate single-digit tokens/second comfortably.
Capability Ceiling
Frontier reasoning, complex agentic tool use, top-tier multimodal vision — these still favor proprietary cloud models. Extraction, classification, RAG, mid-tier reasoning — open-weight models are more than sufficient.
The Four Deployment Tiers — Where Each Workload Lands
Pure On-Premise
High sensitivity + high volume. Open-weight models on dedicated GPU infrastructure inside your perimeter. Default for regulated verticals: healthcare, government, defence, BFSI core.
Hybrid with Sovereign Gateway
Sensitive workloads stay on-prem. Frontier reasoning or low-volume edge cases call out through a PII-redaction proxy. The gateway logs every external call and enforces policy at the request layer.
Cloud-First with Guardrails
Client has no IT footprint to host on-prem, but needs compliance discipline. We deploy a managed gateway, redaction layer, and audit trail. Stack is rented; sovereignty is procedural.
Pure Cloud, No Sovereignty Requirement
Public-facing content, low-stakes automation, no regulated data. There is no value BiltIQ adds over the customer wiring up an API key themselves. We will tell you so — and refer you out.
Quick Map — Sensitivity vs Volume
The Hardware Question Most Vendors Avoid
Every cloud vendor will tell you hardware is yesterday's problem. They are wrong — at least for the workloads where AI actually matters for Indian enterprise. Below ₹50L/year in inference spend, cloud APIs are a rational choice. Above that, every month you delay owning the infrastructure is rent paid to someone else's balance sheet. Eight reasons regulated Indian enterprise AI sits on owned hardware:
Cloud APIs Compound. Hardware Amortises.
A workload doing 1M queries/month at average reasoning length costs ₹18–35L/year on a frontier proprietary API. An equivalent on-prem deployment amortises ₹40–60L of GPU capex over 3 years — and the asset still has residual value at year four.
DPDP Act Is Not Optional.
For healthcare (DISHA), banking (RBI data localisation), government (citizen data), and defence — there is no compliant cloud path that doesn't involve heroic legal contortion. Owned hardware inside your perimeter is the only architecturally clean answer.
Hardware Cost Doesn't Scale With Usage.
Cloud API spend grows linearly with adoption. Hardware capex is bounded — once installed, your 100,000th query costs the same as your 100th. CFOs prefer fixed costs they can model; AI adoption inside the company is the variable they can't.
You Cannot Fine-Tune What You Don't Host.
Domain adaptation — the single highest-leverage move for accuracy on private data — requires open-weight models on owned hardware. Proprietary cloud models offer fine-tuning, but at premium rates, with restrictions, and with your data flowing into their training pipeline.
Local Inference Beats Network Round-Trip.
Even a perfect cloud API adds 80–250ms of network latency. For real-time voice agents, IVR, and interactive co-pilots, that window kills user experience. Local inference removes the round-trip entirely.
Open Weights + Owned Hardware = Portable.
Cloud APIs are a switching tax disguised as a subscription. Your prompts, fine-tunes, RAG indices, and evals are all tightly coupled to one vendor's surface. Owned infrastructure runs whatever open-weight model is best this quarter — and the next.
Capex Builds Equity. Opex Builds Bills.
GPU infrastructure is a depreciable capital asset. It improves enterprise value, qualifies for IT infrastructure depreciation under the Income Tax Act, and survives executive turnover. API spend is consumed and gone.
Cloud Goes Down. Your Business Does Not.
Major cloud AI APIs have had multi-hour outages in 2025 and 2026. If your customer-facing AI is hosted in someone else's datacentre, your incident is their incident. On-prem stays up when your network does.
Indicative 3-Year TCO — 1M Queries/Month Workload
Ranges depend on model size, average tokens per query, and inference batching. BiltIQ models your actual workload in the Blueprint engagement — no vendor-favorable assumptions.
Cloud is rent. Hardware is equity. The crossover happens faster than vendors admit.
Hardware Tiers — Sized for Your Scale
There is no one-size-fits-all GPU stack. BiltIQ sizes to actual concurrent users, model class, and growth horizon. Three reference configurations for on-prem LLM inference in India:
Edge / Departmental
5–50 concurrent users · single use case · 7B–14B models
- ›1× DGX Spark or RTX 6000 Ada
- ›128GB unified / 48GB VRAM
- ›Inference-only workloads
- ›Single-rack footprint
Multi-Use / Multi-Model
50–500 concurrent · 3–6 use cases · 30B–72B reasoning models
- ›3–4× DGX Spark / RTX 6000 Blackwell
- ›Dedicated training node
- ›vLLM + Triton serving
- ›Half-rack to full-rack
Multi-Tenant Inference
500–10,000+ concurrent · agentic workloads · frontier-class on-prem
- ›8–16× H100 / H200 SXM5
- ›NVLink fabric, InfiniBand
- ›Liquid cooling, redundant power
- ›Dedicated facility or DC slot
The Software Stack We Recommend
No proprietary lock-in. Every layer is replaceable. Choices below reflect 2026 production reality — what BiltIQ runs, evaluates, and trusts at scale.
Open weights. Owned hardware. Replaceable components. No lock-in anywhere.
Inside the Blueprint — Sample Output
Most AI advisory engagements end with a slide deck of recommendations and a vendor partner list. BiltIQ's Blueprint ends with engineering artifacts — a complete bill of materials with India landed pricing, per-model concurrency tables, KV cache math, and the formulas your CFO can audit. Below is a sample fragment from a real Blueprint deliverable. The full document runs 60–80 pages.
Sample: GPU BOM Comparison — 70B Production Inference Workload
| Model | 4K ctx users | 32K ctx users | 128K ctx users | Aggregate tok/s |
|---|---|---|---|---|
| Llama 3.3 70B FP8TP=4 · NVLink coherent | 1,500+ | 190 | 47 | 3,000–4,500 |
| Llama 3.1 405B FP8frontier dense · TP=4 | ~325 | 41 | 10 | 600–1,200 |
| DeepSeek V3 671B8-GPU config required | 8-GPU only | — | — | — |
| Model | 4K ctx users | 32K ctx users | 128K ctx users | Aggregate tok/s |
|---|---|---|---|---|
| Llama 3.3 70B FP8TP=4 · PCIe P2P | ~960 | 120 | 30 | 900–1,600 |
| Llama 3.1 405B INT4quantised · TP=4 | ~370 | 46 | 11 | 250–500 |
| DeepSeek V3 671Bdoes not fit | ✗ | ✗ | ✗ | — |
| Model | 4K ctx users | 32K ctx users | 128K ctx users | Aggregate tok/s |
|---|---|---|---|---|
| Llama 3.3 70B FP8TP=2 · PCIe P2P | ~380 | 48 | 12 | 250–500 |
| Qwen 2.5 32B FP8single card or TP=2 | ~640 | 80 | 20 | 800–1,400 |
| Llama 405Bdoes not fit | ✗ | ✗ | ✗ | — |
The Math — Not the Marketing
// factor of 2 = K + V matrices; FP16 KV = 2 bytes/element
// at 32K context per user: 32,768 × 0.32 MB = 10.2 GB per user
If the math doesn't work, you'll know in week one — not after procurement.
Every Blueprint Includes
Engineering artifacts, not slide decks. The full document runs 60–80 pages.
Productized Engagements
Three engagement shapes. Each ends in a written deliverable you can act on without BiltIQ. Fee credits 100% toward implementation if you sign a build contract within 90 days. Priced at roughly half of comparable Indian AI specialist firms — because BiltIQ is entering the market, not defending it.
Indian AI Advisory Market — Where We Sit
AI Discovery Sprint
- →Workshop with technical and business stakeholders to define one priority use case
- →Data inventory, sensitivity classification, volume estimation
- →Stack recommendation memo: model class, deployment tier, build vs buy
- →Indicative TCO range across three viable architectures
- →Go / no-go signal — should this project proceed at all
AI Factory Blueprint
- →Use-case portfolio mapping — every candidate AI initiative scored against the 4-axis framework
- →Deployment-tier classification for each workload, with written rationale
- →Reference architecture — compute, storage, networking, inference, retrieval, orchestration, observability
- →Hardware BOM with India landed pricing across 3 vendor options
- →Per-model concurrency tables (4K / 32K / 128K context) with KV cache math
- →Three-year TCO model with honest sensitivity ranges
- →18-month phased roadmap — quick wins, foundation builds, scaled rollouts
- →DPDP/DISHA/RBI compliance gap analysis
- →Final readout to leadership — board-ready slide pack included
AI Architect Retainer
- →The CTO-in-the-room presence for your AI program — architecture reviews, vendor calls, hiring decisions
- →Quarterly stack audit — what's working, what's drifting, what to retire
- →RFP review and vendor evaluation support — we read the contract before you sign it
- →Direct technical escalation path during incidents and migrations
- →Optional model evaluation runs against your private data on BiltIQ infrastructure
Half the market price. Same engineering depth. Founder-led delivery.
Frequently Asked Questions
How is BiltIQ different from a Big 4 AI advisory engagement?
▼
Big 4 engagements run 4–6 months, cost ₹40L–2Cr, and the senior names in the pitch room are typically not the ones on the project. Ours run 4–6 weeks, cost a fraction, and the founder personally leads delivery. Our advice comes from people running a live multi-node GPU cluster — not from a generalist consulting practice with an AI service line.
Why are you half the market price? What's the catch?
▼
There's no catch. BiltIQ is entering the advisory market, not defending an established price book. Half-market entry pricing is a deliberate beachhead strategy. We are also a product company — advisory is a paid funnel, not the P&L we have to maximise. The trade-off: we take fewer engagements at this rate because we run them ourselves. Our calendar fills faster than larger firms.
Do I really need to invest in on-prem GPU hardware?
▼
For some workloads, cloud is the right answer — Tier C and D in our framework. For workloads with regulated data, high volumes, or fine-tuning needs (Tier A and B), owned hardware almost always wins on TCO within 8–14 months, and is the only path to compliant deployment under DPDP, DISHA, or RBI data localisation rules. The Blueprint engagement gives you the honest math on your specific workload.
What if your recommendation is "don't use BiltIQ products"?
▼
Then that's the recommendation. BiltIQ has walked away from build engagements where the right answer for the client was a different partner or a pure-cloud setup. Our advisory credibility is more valuable than any single contract. If you're a Tier D workload, we'll tell you so and refer you out.
How much hardware capex should I plan for?
▼
Depends entirely on your scale and use case. Entry / departmental deployments start at ₹8–18L for a single-node setup serving 5–50 users. Mid-enterprise deployments serving 50–500 concurrent users typically run ₹35–75L. Datacentre-class deployments with 500+ users on frontier-class on-prem models run ₹4–15Cr. The Blueprint sizes hardware to your actual workload, not vendor catalogue.
Can we start small and grow?
▼
Yes — and BiltIQ usually recommends it. The Discovery Sprint is designed exactly for this: validate one use case for ₹50K, get a written stack recommendation, and decide whether to expand to the Blueprint. Many clients start with a Tier A entry-level cluster (one node), validate two use cases, then expand. You don't need to commit the full architecture upfront.
Are you actually vendor-neutral?
▼
No — and we say so openly. BiltIQ is sovereignty-first. We start from the lens that regulated Indian enterprise needs data control, predictable cost, and freedom from foreign vendor dependency. When cloud is genuinely the right answer, we say so. When it isn't, we say that too. What we are not: a Big 4 firm with seven cloud partnerships obligating us to recommend their stacks.
What if our team can't operate the stack after you deliver?
▼
Three options. One: the Architect Retainer keeps us embedded post-delivery. Two: our build team takes operations on a managed-services basis. Three: we train your team during the Blueprint engagement and hand off — this works for clients with existing platform teams. The Blueprint deliverable explicitly names the operational capability gaps and the staffing or partnership plan to close them.
How long until we see ROI?
▼
Depends on what you're measuring. Cost-substitution ROI (cloud API replacement): 8–14 months for high-volume workloads. Productivity ROI (knowledge worker time saved): 3–6 months for document-heavy workflows. Compliance ROI (avoided audit cost, avoided breach exposure): difficult to quantify, but routinely the largest line item for regulated clients.
Do you help with iDEX, AmplifAI, or government tenders?
▼
Yes — BiltIQ is itself an active applicant to iDEX, ADITI, DRISHTI, and IndiaAI initiatives. We help clients navigate technical RFP responses, sovereign-AI compliance documentation, and DPIIT-aligned bid drafting. This is included in the Architect Retainer and available as an add-on to the Blueprint.
What's the difference between Discovery Sprint and Blueprint?
▼
Discovery Sprint is a single-use-case go/no-go in one day — ideal when you have one specific AI initiative to validate. Blueprint is the full architecture for an AI program across 3–5 use cases, with hardware sizing, model selection, TCO modelling, and an 18-month roadmap. If you're not sure which fits, start with Discovery — its fee credits against the Blueprint if you upgrade within 30 days.
Who actually delivers the engagement?
▼
Harish Subramanian — founder, CEO, Chief AI Architect at BiltIQ — leads every Discovery Sprint and Blueprint personally for the first 12 months of this offering. Beyond that, senior delivery moves to lead architects with the founder remaining accountable. You will not be handed off to a generalist consultant six weeks into the engagement.
Start with the Discovery Sprint
One day. ₹50,000 flat. One workload. A written stack recommendation in your hands by end-of-week. If the math doesn't work, we will tell you in the room — not three months later.
Your Data. Your Premises. Your AI.