Skip to main content
BiltIQ AI logoBiltIQ AI logo
Architecture before code. Decisions before procurement.

Stop Guessing the Stack.

Before You Spend Crores on On-Prem AI.

Most enterprise AI projects in India fail not at deployment, but at the architecture decision two months earlier. The wrong model. The wrong hosting choice. The wrong GPU sizing. By the time it shows up as wasted hardware capex, runaway API bills, or a stalled compliance review, you're already six figures in. BiltIQ AI Factory Advisory helps Indian CIOs, CTOs, and boards make the call before they commit hardware, headcount, or a 3-year vendor contract — at half the price of what Big 4 and Indian AI specialist firms charge for the same answer.

Sovereign by default. Hybrid by design. Cloud only where it's the honest answer.
4–6
Weeks to Decision
4-Axis
Decision Framework
18-mo
Roadmap Delivered
½ Price
vs Market Comp
DPIIT RecognizedNVIDIA Inception PartnerDPDP Act Aligned
Book a Discovery Sprint
30-min call · We map your workload and recommend the right stack tier.
Your data stays private. We never share your information.
01 — Mistakes

Three Expensive Mistakes We Stop You From Making

MISTAKE 01

Picking the Model First, Finding the Use Case Later

Teams sign up for a foundation model subscription, build a demo, then realize the workload doesn't justify the price — or the model can't actually do the job. Stack decisions made on hype, not evaluation, routinely waste ₹40–60L in the first year.

We start with the use case. The model is the last decision, not the first.
MISTAKE 02

Cloud-First Reflex on Sensitive Data

The default is to call an external API because it's fast to ship. Then compliance, legal, or the board ask where customer data is going. Months of rework follow. Per-query costs compound. IP exits the building with every prompt.

We map data sensitivity before stack — not after audit.
MISTAKE 03

On-Premise Because It Sounds Safer

Buying a GPU box without an inference engine, evaluation harness, or sizing model is how on-prem AI projects die quietly. Hardware shows up. Throughput is 3 tokens/sec for 5 users. The CFO asks what happened.

We size capacity against real concurrency, not vendor datasheets.
02 — Framework

The Four Axes — Score Every AI Use Case

No single answer fits all AI workloads inside one company. BiltIQ evaluates each candidate use case across four orthogonal axes. The score vector — not opinion — drives the deployment tier.

AXIS 01

Data Sensitivity

PII, PHI, financial records, regulated content, trade secrets, board-level IP. The higher the regulatory or competitive exposure, the harder the case for on-premise.

HIGH → on-prem mandatory · LOW → cloud acceptable
AXIS 02

Volume Economics

Queries per month × tokens per query. Above roughly 500K queries/month, on-prem TCO beats API spend within 8–14 months. Below 50K queries/month, cloud wins on simplicity.

HIGH VOLUME → own the infra · LOW VOLUME → rent it
AXIS 03

Latency & Throughput

Sub-200ms interactive chat needs either a frontier-grade hosted API or properly sized GPU capacity. Batch, async, and document processing tolerate single-digit tokens/second comfortably.

REAL-TIME → size carefully · BATCH → on-prem trivial
AXIS 04

Capability Ceiling

Frontier reasoning, complex agentic tool use, top-tier multimodal vision — these still favor proprietary cloud models. Extraction, classification, RAG, mid-tier reasoning — open-weight models are more than sufficient.

FRONTIER → proprietary API · STANDARD → open weights

The Four Deployment Tiers — Where Each Workload Lands

TIER A

Pure On-Premise

High sensitivity + high volume. Open-weight models on dedicated GPU infrastructure inside your perimeter. Default for regulated verticals: healthcare, government, defence, BFSI core.

BiltIQ-native · Air-gap optional · DPDP aligned
TIER B

Hybrid with Sovereign Gateway

Sensitive workloads stay on-prem. Frontier reasoning or low-volume edge cases call out through a PII-redaction proxy. The gateway logs every external call and enforces policy at the request layer.

Most enterprises · Where reality actually lives
TIER C

Cloud-First with Guardrails

Client has no IT footprint to host on-prem, but needs compliance discipline. We deploy a managed gateway, redaction layer, and audit trail. Stack is rented; sovereignty is procedural.

SMB / startup · Discipline without datacentre
TIER D

Pure Cloud, No Sovereignty Requirement

Public-facing content, low-stakes automation, no regulated data. There is no value BiltIQ adds over the customer wiring up an API key themselves. We will tell you so — and refer you out.

Not our client · Honest scope-out

Quick Map — Sensitivity vs Volume

High Sensitivity · High Volume
TIER A
Pure on-prem
High Sensitivity · Low Volume
TIER B
Hybrid with gateway
Low Sensitivity · High Volume
TIER B / C
On-prem for cost, cloud for speed
Low Sensitivity · Low Volume
TIER D
Just buy an API
03 — Hardware Investment

The Hardware Question Most Vendors Avoid

Every cloud vendor will tell you hardware is yesterday's problem. They are wrong — at least for the workloads where AI actually matters for Indian enterprise. Below ₹50L/year in inference spend, cloud APIs are a rational choice. Above that, every month you delay owning the infrastructure is rent paid to someone else's balance sheet. Eight reasons regulated Indian enterprise AI sits on owned hardware:

01 · TCO CROSSOVER

Cloud APIs Compound. Hardware Amortises.

A workload doing 1M queries/month at average reasoning length costs ₹18–35L/year on a frontier proprietary API. An equivalent on-prem deployment amortises ₹40–60L of GPU capex over 3 years — and the asset still has residual value at year four.

02 · DATA SOVEREIGNTY

DPDP Act Is Not Optional.

For healthcare (DISHA), banking (RBI data localisation), government (citizen data), and defence — there is no compliant cloud path that doesn't involve heroic legal contortion. Owned hardware inside your perimeter is the only architecturally clean answer.

03 · PREDICTABLE OPEX

Hardware Cost Doesn't Scale With Usage.

Cloud API spend grows linearly with adoption. Hardware capex is bounded — once installed, your 100,000th query costs the same as your 100th. CFOs prefer fixed costs they can model; AI adoption inside the company is the variable they can't.

04 · FINE-TUNE FREEDOM

You Cannot Fine-Tune What You Don't Host.

Domain adaptation — the single highest-leverage move for accuracy on private data — requires open-weight models on owned hardware. Proprietary cloud models offer fine-tuning, but at premium rates, with restrictions, and with your data flowing into their training pipeline.

05 · LATENCY FLOOR

Local Inference Beats Network Round-Trip.

Even a perfect cloud API adds 80–250ms of network latency. For real-time voice agents, IVR, and interactive co-pilots, that window kills user experience. Local inference removes the round-trip entirely.

06 · NO VENDOR LOCK

Open Weights + Owned Hardware = Portable.

Cloud APIs are a switching tax disguised as a subscription. Your prompts, fine-tunes, RAG indices, and evals are all tightly coupled to one vendor's surface. Owned infrastructure runs whatever open-weight model is best this quarter — and the next.

07 · ASSET ON BALANCE SHEET

Capex Builds Equity. Opex Builds Bills.

GPU infrastructure is a depreciable capital asset. It improves enterprise value, qualifies for IT infrastructure depreciation under the Income Tax Act, and survives executive turnover. API spend is consumed and gone.

08 · OUTAGE RESILIENCE

Cloud Goes Down. Your Business Does Not.

Major cloud AI APIs have had multi-hour outages in 2025 and 2026. If your customer-facing AI is hosted in someone else's datacentre, your incident is their incident. On-prem stays up when your network does.

Indicative 3-Year TCO — 1M Queries/Month Workload

Line Item
Cloud API (Y1–3)
On-Prem (Y1–3)
Inference compute
₹54L–1.05Cr (compounding)
₹40–60L capex, one-time
Power + colo + maintenance
Included in API price
₹3–6L/year × 3 = ₹9–18L
Data egress / compliance
Add ₹4–10L for audit overhead
Negligible — data stays put
Residual asset value (Y4)
Zero. Cancel = lose everything.
~30% of capex on secondary market
3-Year Total
₹58L – 1.15Cr
₹49L – 78L

Ranges depend on model size, average tokens per query, and inference batching. BiltIQ models your actual workload in the Blueprint engagement — no vendor-favorable assumptions.

Cloud is rent. Hardware is equity. The crossover happens faster than vendors admit.

04 — The Right Stack

Hardware Tiers — Sized for Your Scale

There is no one-size-fits-all GPU stack. BiltIQ sizes to actual concurrent users, model class, and growth horizon. Three reference configurations for on-prem LLM inference in India:

ENTRY · SMB

Edge / Departmental

5–50 concurrent users · single use case · 7B–14B models

  • 1× DGX Spark or RTX 6000 Ada
  • 128GB unified / 48GB VRAM
  • Inference-only workloads
  • Single-rack footprint
₹8–18L capex
MID · ENTERPRISE

Multi-Use / Multi-Model

50–500 concurrent · 3–6 use cases · 30B–72B reasoning models

  • 3–4× DGX Spark / RTX 6000 Blackwell
  • Dedicated training node
  • vLLM + Triton serving
  • Half-rack to full-rack
₹35–75L capex
SCALE · DATA CENTRE

Multi-Tenant Inference

500–10,000+ concurrent · agentic workloads · frontier-class on-prem

  • 8–16× H100 / H200 SXM5
  • NVLink fabric, InfiniBand
  • Liquid cooling, redundant power
  • Dedicated facility or DC slot
₹4–15Cr capex

The Software Stack We Recommend

No proprietary lock-in. Every layer is replaceable. Choices below reflect 2026 production reality — what BiltIQ runs, evaluates, and trusts at scale.

INFERENCE
vLLM for production throughput · TensorRT-LLM for NVIDIA-optimised paths · SGLang for structured generation · Ollama for development.
Why: vLLM consistently delivers 2–4× the throughput of naive serving at the same hardware budget.
MODELS · GENERAL
Qwen3 family (8B / 30B / 72B) · Llama 4 · DeepSeek V3 · Mistral Large.
Why: open-weight, Apache-friendly licensing, competitive with frontier proprietary on most reasoning benchmarks.
MODELS · INDIC
Sarvam-M for Indic reasoning · IndicTrans2 for translation · AI4Bharat models for low-resource languages.
Why: outperform general models on Hindi, Tamil, Telugu, Bengali, and other Indian languages by significant margins.
MODELS · DOMAIN
MedGemma / Med42 for clinical · BioMistral for life sciences · domain-fine-tuned Qwen / Llama variants.
Why: 15–40% accuracy improvement over general models on domain tasks.
MULTIMODAL
Qwen3-Omni (text + image + audio + video) · InternVL · MiniCPM-V for vision-language.
Why: production-grade VLMs that run on owned hardware without per-image cloud charges.
EMBEDDINGS
Qwen3-VL-Embedding · BGE-M3 · Snowflake Arctic Embed · Nomic Embed.
Why: multilingual, multimodal embeddings without sending documents to a third party.
SPEECH
Qwen3-ASR · Whisper Large v3 · Sarvam-ASR for Indic STT · Sarvam-TTS / OpenVoice for synthesis.
Why: voice agents and IVR in 22 Indian languages without cloud dependency.
RETRIEVAL
Qdrant for production vector search · pgvector for Postgres-native · Weaviate for hybrid retrieval.
Why: self-hosted, horizontally scalable, no per-query pricing.
DATA LAYER
PostgreSQL for OLTP · Redis for cache + queue · MinIO for S3-compatible object storage · Celery for async jobs.
Why: battle-tested OSS, runs anywhere, no surprises.
ORCHESTRATION
MCP (Model Context Protocol) for agent tools · LangGraph for stateful agent flows · FastAPI for service layer.
Why: MCP is becoming the de facto standard for tool calling; LangGraph beats LangChain for production reliability.
GOVERNANCE
Microsoft Presidio for PII redaction · LiteLLM for routing & budgets · OpenPolicyAgent for access control.
Why: redact before any prompt leaves perimeter; enforce policy at request layer.
OBSERVABILITY
Langfuse for LLM tracing & eval · Phoenix for RAG diagnostics · Prometheus + Grafana for infra.
Why: you cannot improve what you cannot see — and every regulated workload needs the audit trail.

Open weights. Owned hardware. Replaceable components. No lock-in anywhere.

05 — Sample Deliverable

Inside the Blueprint — Sample Output

Most AI advisory engagements end with a slide deck of recommendations and a vendor partner list. BiltIQ's Blueprint ends with engineering artifacts — a complete bill of materials with India landed pricing, per-model concurrency tables, KV cache math, and the formulas your CFO can audit. Below is a sample fragment from a real Blueprint deliverable. The full document runs 60–80 pages.

Sample: GPU BOM Comparison — 70B Production Inference Workload

Option A · Dell PowerEdge XE7745
4× H200 NVL @ 600W · 564 GB pooled VRAM · 4-way NVLink · 4U rack
India Landed
₹2.2 Cr
Model4K ctx users32K ctx users128K ctx usersAggregate tok/s
Llama 3.3 70B FP8TP=4 · NVLink coherent1,500+190473,000–4,500
Llama 3.1 405B FP8frontier dense · TP=4~3254110600–1,200
DeepSeek V3 671B8-GPU config required8-GPU only
Option B · Threadripper PRO 9995WX Workstation
4× RTX PRO 6000 Blackwell · 384 GB pooled VRAM · PCIe Gen5 P2P · tower
India Landed
₹62–80 L
Model4K ctx users32K ctx users128K ctx usersAggregate tok/s
Llama 3.3 70B FP8TP=4 · PCIe P2P~96012030900–1,600
Llama 3.1 405B INT4quantised · TP=4~3704611250–500
DeepSeek V3 671Bdoes not fit
Option C · Intel Xeon 600 Workstation
2× RTX PRO 6000 Blackwell · 192 GB pooled VRAM · single-phase 230V · tower
India Landed
₹42–58 L
Model4K ctx users32K ctx users128K ctx usersAggregate tok/s
Llama 3.3 70B FP8TP=2 · PCIe P2P~3804812250–500
Qwen 2.5 32B FP8single card or TP=2~6408020800–1,400
Llama 405Bdoes not fit

The Math — Not the Marketing

KV Cache Memory per Token
2 × num_layers × num_kv_heads × head_dim × dtype_bytes
// factor of 2 = K + V matrices; FP16 KV = 2 bytes/element

Llama 3.3 70B example
2 × 80 layers × 8 KV heads × 128 head_dim × 2 bytes = 0.32 MB / token
// at 32K context per user: 32,768 × 0.32 MB = 10.2 GB per user

Max KV-limited concurrent users
(pooled_VRAMmodel_weights10% activation overhead) / KV_per_user

If the math doesn't work, you'll know in week one — not after procurement.

06 — Deliverables

Every Blueprint Includes

Engineering artifacts, not slide decks. The full document runs 60–80 pages.

Use-case portfolio mapping with 4-axis scoring
Hardware BOM with India landed pricing
Per-model concurrency tables at 4K/32K/128K context
KV cache math, throughput limits, batching analysis
Model shortlist per workload with eval results
Three-year TCO with sensitivity ranges
18-month phased roadmap with dependencies
Power, cooling, rack, and network specifications
DPDP/DISHA/RBI compliance gap analysis
Vendor & partner shortlist where BiltIQ is not the answer
Operational capability plan — staff, train, or partner
Board-ready slide pack for leadership readout
07 — Engagements

Productized Engagements

Three engagement shapes. Each ends in a written deliverable you can act on without BiltIQ. Fee credits 100% toward implementation if you sign a build contract within 90 days. Priced at roughly half of comparable Indian AI specialist firms — because BiltIQ is entering the market, not defending it.

Indian AI Advisory Market — Where We Sit

Provider Tier
1-Day Sprint
4–6 Week Blueprint
Monthly Retainer
Big 4 (Deloitte, EY, KPMG, PwC, Accenture)
₹3L–8L/day
₹40L–2Cr
₹8L–25L
Indian AI Specialists (Quantiphi, Tiger, Course5)
₹1L–3L/day
₹15L–50L
₹3L–8L
Boutique AI consultancies
₹75K–2L/day
₹5L–25L
₹1.5L–4L
BiltIQ AI Factory Advisory
₹50K flat
₹7.5L–12L
₹1.5L

AI Discovery Sprint

One day · One use case · One memo
ENTRY
Duration
1 day
Format
On-site / remote
Investment
₹50,000 flat
Market: ₹1L–2L
  • Workshop with technical and business stakeholders to define one priority use case
  • Data inventory, sensitivity classification, volume estimation
  • Stack recommendation memo: model class, deployment tier, build vs buy
  • Indicative TCO range across three viable architectures
  • Go / no-go signal — should this project proceed at all
Get Started

AI Factory Blueprint

Four to six weeks · Three to five use cases · Full architecture document
FLAGSHIP
Duration
4–6 weeks
Format
Hybrid engagement
Investment
₹7.5L–12L
Market: ₹15L–50L
  • Use-case portfolio mapping — every candidate AI initiative scored against the 4-axis framework
  • Deployment-tier classification for each workload, with written rationale
  • Reference architecture — compute, storage, networking, inference, retrieval, orchestration, observability
  • Hardware BOM with India landed pricing across 3 vendor options
  • Per-model concurrency tables (4K / 32K / 128K context) with KV cache math
  • Three-year TCO model with honest sensitivity ranges
  • 18-month phased roadmap — quick wins, foundation builds, scaled rollouts
  • DPDP/DISHA/RBI compliance gap analysis
  • Final readout to leadership — board-ready slide pack included
Get Started

AI Architect Retainer

Ongoing · Two days per month · Embedded technical authority
ONGOING
Duration
2 days / month
Format
6-month minimum
Investment
₹1.5L / month
Market: ₹3L–4L
  • The CTO-in-the-room presence for your AI program — architecture reviews, vendor calls, hiring decisions
  • Quarterly stack audit — what's working, what's drifting, what to retire
  • RFP review and vendor evaluation support — we read the contract before you sign it
  • Direct technical escalation path during incidents and migrations
  • Optional model evaluation runs against your private data on BiltIQ infrastructure
Get Started

Half the market price. Same engineering depth. Founder-led delivery.

08 — FAQ

Frequently Asked Questions

How is BiltIQ different from a Big 4 AI advisory engagement?

Big 4 engagements run 4–6 months, cost ₹40L–2Cr, and the senior names in the pitch room are typically not the ones on the project. Ours run 4–6 weeks, cost a fraction, and the founder personally leads delivery. Our advice comes from people running a live multi-node GPU cluster — not from a generalist consulting practice with an AI service line.

Why are you half the market price? What's the catch?

There's no catch. BiltIQ is entering the advisory market, not defending an established price book. Half-market entry pricing is a deliberate beachhead strategy. We are also a product company — advisory is a paid funnel, not the P&L we have to maximise. The trade-off: we take fewer engagements at this rate because we run them ourselves. Our calendar fills faster than larger firms.

Do I really need to invest in on-prem GPU hardware?

For some workloads, cloud is the right answer — Tier C and D in our framework. For workloads with regulated data, high volumes, or fine-tuning needs (Tier A and B), owned hardware almost always wins on TCO within 8–14 months, and is the only path to compliant deployment under DPDP, DISHA, or RBI data localisation rules. The Blueprint engagement gives you the honest math on your specific workload.

What if your recommendation is "don't use BiltIQ products"?

Then that's the recommendation. BiltIQ has walked away from build engagements where the right answer for the client was a different partner or a pure-cloud setup. Our advisory credibility is more valuable than any single contract. If you're a Tier D workload, we'll tell you so and refer you out.

How much hardware capex should I plan for?

Depends entirely on your scale and use case. Entry / departmental deployments start at ₹8–18L for a single-node setup serving 5–50 users. Mid-enterprise deployments serving 50–500 concurrent users typically run ₹35–75L. Datacentre-class deployments with 500+ users on frontier-class on-prem models run ₹4–15Cr. The Blueprint sizes hardware to your actual workload, not vendor catalogue.

Can we start small and grow?

Yes — and BiltIQ usually recommends it. The Discovery Sprint is designed exactly for this: validate one use case for ₹50K, get a written stack recommendation, and decide whether to expand to the Blueprint. Many clients start with a Tier A entry-level cluster (one node), validate two use cases, then expand. You don't need to commit the full architecture upfront.

Are you actually vendor-neutral?

No — and we say so openly. BiltIQ is sovereignty-first. We start from the lens that regulated Indian enterprise needs data control, predictable cost, and freedom from foreign vendor dependency. When cloud is genuinely the right answer, we say so. When it isn't, we say that too. What we are not: a Big 4 firm with seven cloud partnerships obligating us to recommend their stacks.

What if our team can't operate the stack after you deliver?

Three options. One: the Architect Retainer keeps us embedded post-delivery. Two: our build team takes operations on a managed-services basis. Three: we train your team during the Blueprint engagement and hand off — this works for clients with existing platform teams. The Blueprint deliverable explicitly names the operational capability gaps and the staffing or partnership plan to close them.

How long until we see ROI?

Depends on what you're measuring. Cost-substitution ROI (cloud API replacement): 8–14 months for high-volume workloads. Productivity ROI (knowledge worker time saved): 3–6 months for document-heavy workflows. Compliance ROI (avoided audit cost, avoided breach exposure): difficult to quantify, but routinely the largest line item for regulated clients.

Do you help with iDEX, AmplifAI, or government tenders?

Yes — BiltIQ is itself an active applicant to iDEX, ADITI, DRISHTI, and IndiaAI initiatives. We help clients navigate technical RFP responses, sovereign-AI compliance documentation, and DPIIT-aligned bid drafting. This is included in the Architect Retainer and available as an add-on to the Blueprint.

What's the difference between Discovery Sprint and Blueprint?

Discovery Sprint is a single-use-case go/no-go in one day — ideal when you have one specific AI initiative to validate. Blueprint is the full architecture for an AI program across 3–5 use cases, with hardware sizing, model selection, TCO modelling, and an 18-month roadmap. If you're not sure which fits, start with Discovery — its fee credits against the Blueprint if you upgrade within 30 days.

Who actually delivers the engagement?

Harish Subramanian — founder, CEO, Chief AI Architect at BiltIQ — leads every Discovery Sprint and Blueprint personally for the first 12 months of this offering. Beyond that, senior delivery moves to lead architects with the founder remaining accountable. You will not be handed off to a generalist consultant six weeks into the engagement.

Start with the Discovery Sprint

One day. ₹50,000 flat. One workload. A written stack recommendation in your hands by end-of-week. If the math doesn't work, we will tell you in the room — not three months later.

Architecture first. Procurement second.
Get in Touch
Website
biltiq.ai
LinkedIn
/company/biltiq-ai
DPIIT Recognized StartupNVIDIA Inception PartnerDPDP Act Aligned

Your Data. Your Premises. Your AI.