LOCAL · ON-PREM · OFFLINE

Run AI on your own hardware.

Stop renting it.

VYROX builds a private Large Language Model on hardware you own - not a cloud server. 100% data privacy, full offline use, and zero monthly subscriptions. We size the GPU, install Ollama / vLLM / LM Studio with Qwen3.6, Kimi K2.6, GLM-5.1 or DeepSeek V4, tune it, and hand it over working.

  • PDPA-aligned by design
  • Live in 4-8 weeks
  • Break-Even Guarantee
  • No vendor lock-in
100%
Your data stays on your hardware
RM0/mo
No subscription after setup
51+
Open models we deploy locally
7 mo
Typical break-even vs cloud
VRAM · Qwen3-Coder 30B @ Q418 / 32 GB

Private. Offline-capable. No tokens metered. Runs on a single workstation in your office.

Engineered in Kuala Lumpur · deployed across Malaysia & Southeast Asia · MY · SG · TH · ID · PH · VN · BN

OllamavLLMLM StudioOpen WebUIQwen3.6Kimi K2.6GLM-5.1DeepSeek V4Gemma 4Llama 4GPT-OSSNVIDIAApple SiliconAMDLangGraphMCP
10+years building production systems
51+open models we deploy & tune
100%on hardware you own & keep
7motypical payback vs cloud AI

Founded 2015 · led by Ts. Dr. Leong Yee Rock - engineers who size, build & commission the system, then hand it over working. No black box, no lock-in.

Why run AI locally

Three things the cloud can never fully give you.

Running a Large Language Model locally means the AI runs on your own computer or server - not a remote cloud like OpenAI or Anthropic. Your data never leaves the building.

01 / PRIVACY

100% data privacy

Contracts, financials, source code and customer records are processed on a machine you own. Nothing is uploaded, logged by a vendor, or used to train someone else's model. PDPA-aligned by default.

PDPAair-gap optional
02 / OFFLINE

Works fully offline

No internet, no problem. The AI keeps running on the factory floor, in a clinic, at a remote site, or during an outage. No API downtime, no rate limits, no "service unavailable."

0ms WAN depno rate limit
03 / COST

No monthly subscription

Cloud AI charges per user, per month, per token - forever. A local model is a one-time setup that then serves your whole team for the price of electricity. No per-seat licence.

pay onceRM0/seat
Interactive · where your data goes

Cloud sends your data out. Local keeps it in.

Toggle to see the difference.

☁ Cloud AI 🖥 Local (VYROX)
  • Prompts & documents stay on your server
  • Nothing sent to OpenAI / Anthropic / Google
  • No third-party logging or training on your data
  • Works with the internet unplugged
YOUR OFFICE 🏢 Staff ON-PREM SERVER 🖥 LLM stays in your building ✓
What you'll actually build

17 things businesses run on a private local LLM.

A local model isn't a chatbot toy - it's the engine behind real systems your team uses daily. Filter by function; each shows the model + tool stack we'd deploy.

All Knowledge & docs Extraction Comms & language Meetings & voice Engineering & data Industry
Chatbot · Workflow · Agentic

Three ways to put your local AI to work.

The same private model on the same hardware runs in three modes - along a spectrum of autonomy. Start with a chatbot for quick wins, add workflows for high-volume operations, then deploy agents for complex 24/7 work.

Chatbot
Workflow
Agentic
← human-driven · simple · instantautonomous · multi-step · 24/7 →
Chatbot AI Workflow AI Agentic AI
Conversational · human-in-every-turn

A private assistant your staff chat with.

The simplest, fastest mode: a ChatGPT-style window grounded in your own documents (RAG). The human reads every answer, so it's low-risk and quick to deploy.

Ask Retrieve (RAG) Answer + cite Human decides
STACK

Open WebUI / LibreChat / AnythingLLM front-end + Qwen3 or Gemma 3 + RAG (nomic-embed + Qdrant) · Ollama / LM Studio

USE CASES

Internal knowledge assistant, customer-support chatbot, HR/policy helper, documentation Q&A, onboarding buddy.

EFFORT

Days to deploy · lowest run-cost · human always in the loop.

At a glance💬 Chatbot🔀 Workflow🤖 Agentic
AutonomyLow - human every turnMedium - runs unattendedHigh - decides its own steps
PathFree conversationFixed pipeline (DAG)Planned at runtime
Determinismn/aHigh - repeatableLower - reasoned each run
Human oversightReads every answerReviews exceptionsApproves irreversible actions
Build effortDays1-3 weeks3-8 weeks
Run cost (local)LowestLowHigher (many calls/run)
Best local modelsQwen3.6, Gemma 4, LlamaQwen3.6, Qwen3-VL, MistralKimi K2.6, GLM-5.1, DeepSeek V4, Qwen3.6
FrameworksOpen WebUI, LibreChatn8n, LangGraph, Flowise, DifyLangGraph, CrewAI, AutoGen, OpenHands, MCP
Best forQ&A, support, knowledgeHigh-volume repetitive opsComplex multi-step, 24/7 automation

The mature path: most clients start with a chatbot (value in days), automate their highest-volume process as a workflow, then add agents where the work is genuinely multi-step. All three run on the same local box - and VYROX engineers tool-calling, MCP connectors, guardrails and an eval harness so agents are reliable, not just impressive.

Co-pilots for regulated professions

AI Doctor. AI Accountant. AI Lawyer. On your own hardware.

These three professions handle data that is legally confidential - patient records, privileged files, financial accounts. For them, local AI isn't just cheaper; it's often the only compliant option. Each is a co-pilot that keeps the licensed professional firmly in the loop.

AI Doctor AI Accountant AI Lawyer
For clinics · GPs · specialists · dental · allied health

A clinical co-pilot that never sends a patient record to the cloud.

It listens, drafts and retrieves - so clinicians spend less time on paperwork and more with patients. Every record stays inside the clinic.

  • Ambient scribe - transcribes the consultation and drafts structured SOAP notes (Whisper + LLM)
  • History summary - condenses long patient files before each visit
  • Letters - drafts referral & discharge letters for clinician review
  • Patient education - leaflets in EN/BM/ZH at the right reading level
  • Coding assist - suggests ICD-10 codes for claims
  • Guideline lookup - surfaces references from your own formulary & protocols (RAG)
  • Front-desk triage - structures symptom intake for the queue
STACK

Whisper large-v3 (transcription) + Qwen3 + RAG over clinical guidelines · air-gapped Desk or Studio build

PRIVACY & COMPLIANCE

Patient data never leaves the clinic - supports PDPA and medical confidentiality; air-gap option for full isolation. MMC-aware.

TYPICAL ROI

Clinicians recover ~1-2 hours/day of documentation time. Build from RM9k-32k, one-time.

⚕️ Decision-support only. A registered medical practitioner makes every clinical decision; the system does not diagnose or treat and is not a registered medical device. It assists documentation and retrieval, with the clinician reviewing all output.

Connect your own data · RAG

Make the model answer from YOUR files - privately.

A general model knows the public internet. It does not know your contracts, SOPs, pricing or customer history. RAG (Retrieval-Augmented Generation) grounds every answer in your own documents - with citations - and on a local deployment that data never leaves the building.

How it works - 6 steps

  1. Ingest - pull in PDFs, Word, Excel, scans, emails, DB records, intranet pages.
  2. Chunk - split documents into small, meaningful passages for precise retrieval.
  3. Embed - convert each chunk into a numeric fingerprint with a local model (nomic-embed / bge). No cloud calls.
  4. Store - fingerprints go into a local vector database (pgvector, Qdrant or Chroma) on your server.
  5. Retrieve - a question finds the most relevant chunks from your data.
  6. Ground & answer - the local LLM answers using only those chunks, and cites the source.
Why this is the killer use case
  • Turns a generic model into one that knows your business
  • Answers carry citations - staff verify, don't just trust
  • Slashes hallucination: it answers from retrieved facts
  • Update knowledge by adding files - no retraining
Why local makes it safe
  • Documents, index and model all sit on your server
  • Nothing sent to OpenAI, Google or any external API
  • For patient records, financials, legal files - often the only acceptable answer
  • PDPA-aligned by design, not by a vendor's promise
Open vs cloud · the quality question

"Is local as good as ChatGPT?" On the work you do - close enough to matter.

The best open models now land within a few points of frontier cloud on knowledge, science and coding benchmarks - and run on hardware you own. Scores are approximate, drawn from public leaderboards and model cards (mid-2026); harnesses differ, so treat ±2-3 points as noise.

ModelTypeMMLU-Pro
knowledge
GPQA
science
SWE-bench
coding
✓ Where open wins for you

On knowledge (MMLU-Pro ~84-90), graduate science (GPQA ~80) and coding (SWE-bench ~65-67), top open models match or beat older cloud models like GPT-4o - running privately on your hardware, with no per-token bill.

⚠ Where cloud still leads

The hardest agentic, long-horizon tasks - large-repo autonomous coding, sustained multi-step planning - still favour the latest frontier Claude/Gemini/OpenAI. That's exactly what our hybrid setups route to the cloud, on demand.

The honest bottom line

For drafting, summarising, extraction, Q&A, translation and everyday coding - the bulk of business work - open models are more than good enough. You keep ~all the quality and stop paying for everything.

Trending local models · updated 2026

The open models worth running - filter by what you do.

Now including the April-May 2026 wave - Qwen3.6, Kimi K2.6, GLM-5.1, DeepSeek V4, MiniMax M2.7 and Gemma 4. Every model below runs locally with Ollama, LM Studio or vLLM. Filter by task, sort by size, and see the VRAM each needs at Q4.

All Coding Reasoning General Writing Vision Edge Embed Audio

Everything that runs a local LLM

GPUs, Apple Silicon, mini-PCs, servers - the complete map.

The practical accelerators for genuine local LLM hosting. Prices are indicative mid-2026 street estimates (an active DRAM/GPU shortage is inflating prices - we confirm exact figures at quote time).

NVIDIA & AMD GPUs

GPUVRAMBandwidthApprox RMBiggest model @ Q4Class
Intel Arc B58012 GB456 GB/sRM 1.2k-1.6k7-8B (IPEX/Vulkan)Budget
RTX 306012 GB360 GB/sRM 1.3k-1.8k7-8BBudget
Intel Arc A77016 GB560 GB/sRM 1.4k-1.9k13-14BBudget
RTX 4060 Ti 16GB16 GB288 GB/sRM 2.2k-2.8k13-14BBudget
RTX 407012 GB504 GB/sRM 2.8k-3.2k13BBudget
RTX 507012 GB672 GB/sRM 3k-3.7k13BConsumer
RTX 5070 Ti16 GB896 GB/sRM 4.2k-5.5k14B; GPT-OSS 20BConsumer
RTX 3090 / Ti24 GB936 GB/sRM 4.1k-6k (used)32B dense / 30B-A3BBudget
RTX 4080 / Super16 GB717 GB/sRM 4.6k-6k14B; GPT-OSS 20BConsumer
RTX 409024 GB~1008 GB/sRM 8.7k-12k32B dense / Gemma 3 27BProsumer
RTX 508016 GB~960 GB/sRM 5.5k-8k14B; GPT-OSS 20BProsumer
RTX 509032 GB1792 GB/sRM 14k+32B comfortably; 70B Q3 tightProsumer
RTX PRO 5000 Blackwell48 / 72 GB1344 GB/sRM 19k+70B dense Q4Workstation
RTX 6000 Ada48 GB960 GB/sRM 31k-37k70B dense Q4Workstation
AMD Radeon PRO W790048 GB864 GB/sRM 16k-18k70B (ROCm)Workstation
RTX PRO 6000 Blackwell96 GB1792 GB/sRM 39k-44k120B-class on ONE cardWorkstation
NVIDIA L424 GB300 GB/sRM 11k+32B (low-power, slow)Datacenter
NVIDIA L40S48 GB864 GB/sRM 32k-41k70B dense Q4Datacenter
A100 80GB80 GB2039 GB/sRM 41k-69k120B-class; 235B (2×)Datacenter
H10080 GB3.35 TB/sRM 115k-147k120B single; frontier multiDatacenter
H200141 GB~4.8 TB/sRM 115k-161k235B singleDatacenter
B200 (Blackwell)180 GB~8.0 TB/sRM 161k+235B+ single; frontierDatacenter
AMD Radeon 7900 XTX24 GB960 GB/sRM 4.1k-5k32B (ROCm/Vulkan)Budget
AMD Instinct MI300X192 GB5.3 TB/sRM 46k-69k235B+ single GPUDatacenter
AMD Instinct MI325X256 GB6.0 TB/sRM 92k+300B+ single GPUDatacenter

Multi-GPU note: consumer 40/50-series have no NVLink - GPUs talk over PCIe (tensor-parallel via vLLM, with overhead). 2× 24GB ≈ 48GB → 70B Q4; 4× 3090 ≈ 96GB → 120B-class. A single RTX PRO 6000 96GB often beats multi-GPU on simplicity and power.

Apple Silicon - unified memory advantage

ChipMax memoryBandwidthProductApprox RMBiggest model @ Q4
M416-32 GB120 GB/sMac Mini / AirRM 2.8k+14B-30B-A3B
M4 Pro64 GB273 GB/sMac Mini Pro / MBPRM 6.4k+32B; 70B Q4 tight
M4 Max128 GB546 GB/sMac Studio / MBP 16RM 9.2k+70B dense; GPT-OSS 120B
M5 Max NEW 2026128 GB~546 GB/sMac Studio / MBP 16RM 10.5k+70B dense; faster GPU + NPU, better tok/s
M3 Ultra256 GB819 GB/sMac StudioRM 18.4k+235B-class; Qwen3-235B Q4

Apple's unified memory lets the GPU address all RAM - a cheap path to huge models, limited by bandwidth not capacity. A 256GB M3 Ultra Mac Studio is the most popular single-box big-model machine. (Note: M4 Ultra was never released; Apple withdrew the 512GB option in 2026.)

Mini-PCs & "AI-in-a-box"

DeviceChipMemoryWhat it runsApprox RM
NVIDIA DGX SparkGB10 Grace-Blackwell128 GBup to ~200B Q4; CUDA-native dev boxRM 21.6k
Framework DesktopRyzen AI Max+ 395128 GB unified70B, GPT-OSS 120B, Qwen3-235B Q4RM 9.2k-13k
GMKtec / Minisforum / CorsairRyzen AI Max+ 395128 GBsame Strix Halo classRM 11k-16k
Jetson Thor (AGX)Blackwell edge128 GB70B-class at the edgeRM 16k
Jetson Orin Nano/AGXAmpere edge8-64 GB7B-13B (robotics / CCTV)RM 1.1k-9.2k

DGX Spark vs Strix Halo: DGX Spark wins on CUDA software compatibility; Strix Halo wins on price (~half) and x86/Linux. Both are bandwidth-limited (~256-273 GB/s) - great for memory-heavy MoE inference, weaker on fast prefill.

TPUs & other accelerators - the honest truth

AcceleratorLocal LLM?Reality
Google Coral Edge TPU✗ NoBuilt for tiny vision CNNs. No DRAM, int8 only, no transformer/attention support - cannot run even a 1B LLM.
Google Cloud TPU (Trillium/Ironwood)⚠ Cloud-onlyPowerful for training/serving via JAX, but rented hourly - not on-prem hardware you own.
Groq LPU⚠ CloudUltra-low-latency inference as a cloud API; real deployments need racks. Not a consumer-local box.
Cerebras WSE-3✗ NoWafer-scale, $2M+ datacenter systems. Not local in any SME sense.
Hailo-8/10 NPU⚠ NicheEdge vision / very small on-device models only.

Takeaway: for genuine local LLM hosting, the practical accelerators are NVIDIA GPUs, AMD Instinct/Radeon, Apple Silicon, and unified-memory mini-PCs. We'll tell you honestly which fits - never sell you a Coral stick for an LLM.

Multi-GPU servers & rack

TierGPUsHostUse caseApprox RM
Entry rig2× RTX 3090/4090Threadripper, 128GB, 1500W70B Q4, small teamRM 23k-41k
Prosumer WS1-2× RTX PRO 6000 96GBTR PRO, 256GB ECC120B single-boxRM 55k-100k
4-GPU server4× L40S / RTX 6000 AdaDual EPYC, 512GB-1TB ECC, 4U235B Q4, multi-user vLLMRM 184k-322k
8-GPU HGX8× H100/H200 SXMNVLink/NVSwitch, 2TB RAM, liquidFrontier inference + trainingRM 1.4M+

Engineering: 8× H100 ≈ 5.6kW (needs 3-phase power); 4U GPU servers are loud (liquid cooling above 4× SXM); EPYC/Threadripper PRO for PCIe 5.0 lanes; platinum/titanium PSUs with N+1 redundancy. VYROX sources, assembles, commissions and supports the whole node.

Will it fit? · interactive

Pick a model. See the VRAM it needs and what runs it.

Choose a model, quantization and context length - we compute the memory and light up the hardware that fits.

Quantization
FP16 Q8 Q6 Q4 Q3
Context length 8K
Concurrent users 1
VRAM / memory required
18 GB
Qwen3-Coder 30BQ4
Quantization explained

Smaller numbers, almost the same brain.

Quantization shrinks a model by storing weights at lower precision. Drag to see size vs quality - Q4 is the local sweet spot.

Precision Q4_K_M

Q4_K_M - the local default. ~4× smaller than FP16 with only ~2-3% quality loss. The best fit-vs-quality tradeoff for almost every build.

Llama 70B: 140 GB → 40 GB

Size
28%
Quality
98%
FP162.0 B/paramreference, fine-tuning
Q8~1.1 B/paramnear-lossless
Q4_K_M~0.6 B/paramsweet spot
Q2~0.4 B/paramsqueeze big models, quality drops
What can I run on MY machine?

Tell us your hardware. We'll list what fits.

Usable memory for models
~30 GB

A capable single-GPU / Apple-Silicon class machine.

Models that run on your machine (best practical quant)
Sizing cheat-sheet

Model size → memory → hardware, at a glance.

Rule of thumb: Q4 VRAM (GB) ≈ params(B) × 0.6. MoE models size to total params for memory, but run at the speed of their active params. Always add KV-cache for long context.

Model sizeQ4_K_MQ8FP16Example hardware (Q4)
1-3B~1-2 GB~3 GB~6 GBAny iGPU, phone, Jetson, 8GB GPU
7-8B~5-6 GB~9 GB~16 GBRTX 4060 8GB, M4 16GB
13-14B~9-10 GB~16 GB~28 GBRTX 4070/4080, M4
24-32B~16-20 GB~34 GB~65 GBRTX 3090/4090/5090, M4 Pro
70B dense~40-43 GB~75 GB~140 GB2× 3090/4090, RTX 6000 Ada, M4 Max
120B MoE (GPT-OSS/Scout)~60-65 GB~120 GB-RTX PRO 6000 96GB, A100, M3 Ultra, DGX Spark
235B MoE~135-145 GB~250 GB-2× A100, MI300X, M3 Ultra 256GB
671B (DeepSeek)~380-400 GB~700 GB-8× H100/H200, multi-MI300X
~1T MoE (Kimi K2.6 / DeepSeek V4)~550-880 GB~1-1.6 TB-8× H200/B200 server

Context cost: KV-cache grows with context length × layers. A 7-8B model adds ~0.5-1 GB per 8K tokens; a 70B at full 128K context can add 20-40GB+ - often the hidden cost that blows a VRAM budget. We size for weights plus your real context window.

Licensing, power & operations

The details that decide whether you can actually run it.

Open weights don't all mean "free for business." And a multi-GPU box has real power, heat and uptime needs. We handle all of it - here's the honest picture.

Open-weight licensing (commercial use)

LicenseModelsCommercial use
Apache-2.0Qwen3 & Qwen3-Coder, GPT-OSS, Mistral Small 3.1 / Nemo / Devstral / Magistral, Gemma-adjacentFree, permissive - yes
MITDeepSeek R1 / V3.1 (+ distills), Phi-4, GLM-4.5/4.6, bge-m3, WhisperFree, permissive - yes
GemmaGemma 3 (1B-27B)Yes, under Google's Gemma terms (prohibited-use policy applies)
Llama Community / Llama 4Llama 3.x, Llama 4 Scout / MaverickYes - but a special licence is required above 700M monthly active users
Mistral Research (MRL)Mistral Large 2, Ministral 8BResearch/non-commercial weights - commercial needs a Mistral licence
MNPL (non-production)Codestral 22BNot for production without a commercial agreement

VYROX defaults your build to permissively-licensed models (Apache/MIT) so you own your deployment outright - and flags any model with commercial restrictions before it's used.

Power, heat & electrical

TierPeak drawHeat outputPower / coolingNoise
Desk AI~0.4-0.6 kW~2,000 BTU/hStandard 13A wall socket, room airQuiet desktop
Studio AI~0.6-0.9 kW~3,000 BTU/hDedicated 13-15A circuit, ventilatedLow-moderate
Engine AI~1.2-1.8 kW~6,000 BTU/h20A circuit, UPS, server cupboard / airconServer-grade fans
Rack AI (8-GPU)~6-7 kW~22,000 BTU/h3-phase power + room cooling / liquid, N+1 PSULoud - data-centre/room

A Desk/Studio build sips a few ringgit of electricity a day. Engine tiers want a UPS and a ventilated cupboard. Rack tiers need real facilities - we assess your site and spec power, cooling and UPS as part of delivery.

Networking & access

Staff reach it on your LAN via a browser (Open WebUI) or an OpenAI-compatible API. Remote access over VPN; a reverse proxy + gateway handles auth, SSO and rate-limiting; vLLM load-balances many users across GPUs.

Backup & high availability

Models, the vector DB and configs are backed up and reproducible. For mission-critical use we add a warm spare, a cloud-failover line, or a second node so a single box is never a silent point of failure.

Warranty & support

Hardware carries manufacturer warranty (typically 3 years on workstation/datacentre parts); we handle RMA. Optional managed-service tier adds monitoring, model upgrades and a same-business-day SLA. Leasing / instalment options available.

Feel the speed · interactive

How fast is local, really?

Pick a model and a machine - we'll stream sample text at the estimated tokens/sec so you can feel it.

Qwen3-Coder 30B · RTX 5090~48 tok/s
Press “Run it” to watch the model generate at its estimated local speed…

Estimated single-stream throughput; real speed varies with quant, context and batching. Most people read at ~5 tok/s - anything above that feels instant.

The money math · interactive

Cloud AI is a bill that never stops. Local is a cost that ends.

Move the sliders to your team. Watch the cumulative cost diverge and find your break-even month.

Team size 20
Cloud cost / person / month RM 180
Years to compare 5
One-time local build (RM) RM 27,000

Build cost auto-suggests from the tier you'd need; drag to match a real quote.

You save over 5 years
RM 138,000
Cloud subscriptionLocal (VYROX)Break-even
Break-even
7 months
Cloud over period
RM 216,000

Illustrative estimate. Cloud ≈ ChatGPT Team / Claude Team seats (~RM130-280/user/mo) at ≈ RM4.6/USD; local figure includes hardware + commissioning; electricity ~RM1-6/day. Your exact crossover is calculated in the free audit.

Honest decision guide

Local, cloud, or hybrid? The straight trade-off.

No single answer is right for everyone. Here's the comparison without the spin - including when cloud or hybrid is genuinely the better call.

DimensionLocal / On-premCloud APIHybrid
Data privacyHighest - never leaves youLowest - sent to a third partyHigh - sensitive stays local
Recurring costLow & fixed (power + support)Variable - can balloonMixed - base + overflow
Upfront costHigher - hardwareNear zeroModerate
LatencyLow, predictable (your LAN)Internet + provider loadDepends on path
Offline operationYesNoPartial
Frontier reasoning ceilingVery good (hardware-capped)HighestBest of both
Scaling to spikesHardware-limitedElasticBurst to cloud
MaintenanceManaged by VYROXVendor-managedMost complex
When cloud is actually better
  • You need the absolute top frontier model and data isn't sensitive
  • Usage is very low or unpredictable - hardware would sit idle
  • You're prototyping before committing to hardware
  • Sudden massive spikes on-prem can't economically cover
When hybrid wins
  • Routine work runs local; a few hard tasks need a frontier model
  • Sensitive data strictly on-prem, non-sensitive bursts to cloud
  • You're migrating cloud → local and want a gradual cutover
  • You need an overflow valve for seasonal peaks
Our honest take

For most SMEs handling private or regulated data with steady daily volume, local pays for itself in months and removes per-token billing risk. If cloud or hybrid fits you better, we'll tell you - and build that instead.

Designed builds · costed

Four ready-to-deploy local-AI builds.

Complete, VYROX-commissioned systems - hardware sized, models loaded, runtime and agents wired in, staff trained.

Tier 01 · Solo desk

Desk AI

RM 9k-13k once
  • Mac Mini M4 Pro 48GB or RTX 4090 24GB + 128GB RAM
  • Qwen3-Coder 30B, DeepSeek R1 Distill 32B, Gemma 3 27B
  • ~30-45 tok/s single-stream
  • ~1-3 concurrent · team of 3-8
  • Ollama + Roo Code / Continue.dev
Replaces ~RM12k/yr cloud + Copilot. Payback ~12-18 mo incl. setup.
Tier 02 · Team - popular

Studio AI

RM 22k-32k once
  • Mac Studio M4 Max 128GB (runs 70B) or RTX 5090 32GB (runs 32B-class)
  • Qwen3 32B, DeepSeek R1 70B (Mac), GLM 4.5 Air
  • ~60-100 tok/s on 32B single-stream
  • ~5-10 concurrent · team of 20-50
  • Ollama + Open WebUI + Cline / Aider
Replaces ~RM33k/yr cloud seats. Payback ~10-14 mo - then no per-seat licence.
Tier 03 · Heavy / agents

Engine AI

RM 55k-75k once
  • 2× RTX 5090 (64GB) or RTX PRO 6000 96GB, Threadripper, 256GB
  • Qwen3.6-35B, GLM-5.1, Kimi K2.6 / DeepSeek V4 (server)
  • 100+ tok/s aggregate (vLLM batching)
  • ~15-30 concurrent · 60-150 staff + agents
  • vLLM + Open WebUI + agent fleet
Agent token-burn alone runs RM100k+/yr in cloud. Payback ~6-9 mo.
Tier 04 · Org / frontier

Rack AI

RM 180k-500k+ once
  • 4-8× RTX 5090 / RTX PRO 6000 / H100 / H200 / DGX
  • DeepSeek V3.1 671B, Qwen3-235B, frontier open models
  • Hundreds tok/s aggregate
  • ~100+ concurrent · whole organisation
  • vLLM cluster + gateway + SSO + monitoring
Replaces RM300k+/yr enterprise AI - with full data sovereignty.

How "users supported" is calculated: concurrent users = free VRAM after model weights ÷ KV-cache per session (≈ 2 × layers × kv-dim × context × precision), capped by GPU throughput ÷ a 15 tok/s per-user floor and by the runtime's parallel slots (Ollama ~8, vLLM many). "Team size" assumes typical intermittent office use (~1 active generation per 5 staff). Numbers shown are at ~8K context with FP16 KV-cache; Q8 KV-cache roughly doubles concurrency and shorter context increases it further. Use the cost configurator to model your exact model, context and concurrency.

Build advisor · interactive

Tell us your purpose. Get your build, cost and payback.

Pick what you'll use AI for and your team - the recommendation, hardware, model and 5-year savings update instantly.

What will you use AI for? pick all
Coding Research Writing Vision / docs 24/7 agents
People using it 10
Cloud cost / person / mo RM 180
Hardware preference
No preference Apple Mac NVIDIA GPU
Recommended build
Studio AI

A shared server for your team - big-model quality for everyone, fully private.

HardwareRTX 5090 / Mac Studio
ModelQwen3 32B / R1 70B
Memory32-128 GB
Users~5-10 concurrent
ToolsOllama + Open WebUI
RM 27k
One-time
11 mo
Break-even
RM 138k
5-yr saved
Get this build quoted free
Cost configurator · build it your way

Adjust the model, the hardware and the usage. See the full cost.

Two ways to use it: set how many concurrent users you need and leave hardware on Auto - we pick the cheapest build that serves them - or drive every variable yourself (LLM size, quantization, GPU, context, hours/day). The itemised build cost, capacity, payback and cloud savings update live. Indicative estimates; your exact quote comes from the audit.

1 · Primary purpose
Coding Research Writing Vision / docs RAG chatbot 24/7 agents
2 · LLM size 24-32B
3 · Quantization Q4_K_M
4 · Context length 8K
5 · Concurrent users needed 4

On Auto, we size the cheapest build that serves this many at your chosen model & context.

7 · Usage 8 h/day
8 · Support plan
Self-managed Standard Premium
Cloud you'd otherwise pay · seats 10
Cloud cost / seat / mo RM 180
Compare over 5 yrs
Fits - Qwen3 32B class on this build
Users this build supports
~6 concurrent · team of ~30
One-time build
RM 27,000
Payback vs cloud
11 mo
Running / yr
RM 2,400
Performance
~60 t/s
Saved / 5 yrs
RM 138k
Total one-timeRM 27,000
Cumulative cost - cloud vs your build
CloudYour buildBreak-even
Get this exact build quoted free

Indicative estimate. Hardware ≈ Malaysia street pricing during the 2026 GPU/DRAM shortage; commissioning, electricity (TNB commercial ~RM0.50/kWh) and optional support included as shown. Cloud baseline ≈ Team-tier seats (+ agent API where selected). VYROX confirms exact specs, prices and a measured savings projection in the free audit.

Stop guessing - get the numbers

We'll size your exact build and put the break-even date in writing.

A free 45-minute Local-AI Audit: your real cloud spend today, the right hardware, the models, and the costed payback - no obligation, no sales deck.

Security & compliance

"Where does my data go?" Nowhere you don't control.

Built for the privacy-driven buyer. Choose a deployment mode, and layer on the controls your auditor expects.

MODE 01

On-premise server

The LLM runs on a server physically inside your office, factory or data centre. Reachable over your LAN/VPN; the public internet cannot touch it. Best balance of control and convenience.

MODE 02

Air-gapped

No internet connection at all. Updates applied manually via controlled media. For the most sensitive environments - defence-adjacent, critical infrastructure, regulated health/finance.

MODE 03

Private cloud / VPC

Deployed inside your own cloud tenancy or a Malaysian data centre. You keep data residency and isolation, with cloud scalability - local-grade control without owning physical servers.

PDPA alignmentArchitected so personal data stays within your control and within Malaysia, supporting your PDPA 2010 obligations.
Data residencyYou choose exactly where data physically lives - it can stay entirely on your premises / in Malaysia.
No third-party sharingNo prompts, documents or outputs are sent to any external AI provider. Full stop.
Role-based access (RBAC)Users and teams only see the data and tools they're permitted to.
Audit loggingEvery query and access is logged for traceability and incident review.
EncryptionAt rest (documents, vector DB, model data on disk) and in transit (TLS across LAN/VPN).
Single Sign-OnIntegrates with Microsoft Entra/Azure AD, Google Workspace, or LDAP.
ISO 27001-aligned processWe follow ISO 27001-aligned practices for access, change and key management during delivery.

VYROX implements ISO 27001-aligned and PDPA-aligned controls and supports your compliance posture; formal certification of your organisation remains with you and your auditor.

RAG vs fine-tuning

Most customers need RAG, not fine-tuning. We'll tell you which.

Use RAG when…

You want the model to know your facts - documents, policies, products, prices. Faster, cheaper, instantly updatable (just add files), and it cites sources. This covers the large majority of business needs.

Fine-tune (LoRA) when…

You need to change the model's behaviour or style - a consistent house tone, a strict output format, domain jargon, or a narrow task it must perform identically every time. RAG adds knowledge; fine-tuning shapes behaviour.

  • What it needs: a few hundred to a few thousand good example pairs - quality matters far more than quantity.
  • What it costs: LoRA is efficient - a single GPU, hours-to-days, not weeks of full retraining.
  • It stays private: training runs on your hardware; your data never leaves, and the resulting model is yours.
  • VYROX handles it end-to-end - curate examples, train, validate against real cases, and roll back if it doesn't beat the RAG baseline. We only fine-tune when it earns its keep.
The runtime & agent stack

What you actually drive the model with.

EASIEST · GUI

LM Studio

The most polished, beginner-friendly graphical app. Browse, download and chat with models - no command line.

POPULAR · CLI

Ollama

Developer favourite. One command - ollama run qwen3 - pulls and serves a model in the background.

API · OFFLINE

LocalAI

Drop-in replacement for the OpenAI API. Point existing apps at your local server - text, audio and image, fully offline.

PROD · THROUGHPUT

vLLM

High-throughput serving for teams: tensor-parallel multi-GPU, batching, OpenAI-compatible endpoints.

Open WebUI · team chat Roo Code · IDE agent Cline · autonomous coding Aider · terminal pair-programmer Continue.dev · IDE copilot llama.cpp · the engine MCP · tool connectors

Which runtime, when?

RuntimeBest forInterfaceMulti-GPUScale
OllamaEasiest one-dev prototyping, any OSCLI + APIWeak (1 GPU/req)Single user
LM StudioGUI-first desktop for non-CLI usersGUI + serverLimitedSolo → small team
vLLMProduction multi-user servingServer + APIStrong (TP/PP)Production team
llama.cppEdge / embedded / max format controlCLI + serverYes (layers)Single / edge
LocalAIOpenAI-compatible API gateway over many backendsREST APIVia backendTeam / gateway
Open WebUIShared ChatGPT-style team interfaceWeb GUIN/A (front-end)Team chat

Rule of thumb: solo + simplicity → Ollama / LM Studio · edge/embedded → llama.cpp · concurrent production serving → vLLM · one API over mixed backends → LocalAI · team chat UI → Open WebUI on top. VYROX picks and configures the right combination for your build.

How VYROX delivers

From first call to in-production - in 4 to 8 weeks.

A clear, six-phase engagement. Indicative timelines for a typical SME deployment; you get something concrete at every step.

Free Audit 2-5 days

We map use cases, review your data and infrastructure, and assess fit. You get: a written findings summary and an honest local-vs-cloud-vs-hybrid recommendation. No cost, no obligation.

Spec & Quote 3-7 days

We design models, hardware sizing, deployment mode, integrations and security controls. You get: a detailed spec and a fixed-price quote in RM with timeline.

Procure & Build 1-3 weeks

We source and configure hardware (or provision your VPC), install the stack, and build your RAG/data pipeline. You get: a configured, tested system; procurement handled for you.

Install & Integrate 3-7 days

We deploy on your premises (or tenancy) and connect data sources, SSO and existing tools. You get: a live, integrated, security-hardened system.

Tune & Train 1-2 weeks

We tune retrieval and prompts on your real data, run accuracy checks, and train your staff. You get: a validated system, trained users, and docs (EN/BM).

Ongoing Support continuous

Monitoring, updates, model upgrades and a support line. You get: SLA-backed support and a roadmap for new use cases.

Straight answers

"Sounds good - but is local really practical?" Yes.

MYTH
"Local is a downgrade from ChatGPT."

Fact: for everyday work - drafting, summarising, extraction, coding, internal Q&A - Qwen3.6, Kimi K2.6, GLM-5.1 and DeepSeek V4 run at quality very close to the big clouds. We build hybrids that call cloud only when it genuinely wins.

MYTH
"It'll be obsolete in a year."

Fact: new open models ship monthly and are free. Swapping today's model for next year's best is a one-line change on the same hardware. Your rig gets smarter over time, for RM0.

MYTH
"There's no support if it breaks."

Fact: every build ships with remote monitoring, free model upgrades, and a same-business-day SLA. Standard open-source, documented, your IT trained. No black box, no lock-in.

The Break-Even Guarantee

We won't quote a build unless it pays back within 12 months versus your current cloud bill.

If your measured first-year savings don't beat the subscriptions it replaced, we re-tune the system at our cost until they do - and you keep the hardware either way. Every build also includes remote health monitoring, free model & runtime upgrades, and a same-business-day response SLA.

Jargon-buster

Every term you'll hear, in plain English.

LLM, open-weight model & parameters
LLM - the AI that understands and generates text; the "brain." Open-weight model - one whose internals are published so you can download and run it yourself (Qwen3, Llama 4, Gemma 3) - what makes local deployment possible. Parameters - the model's internal settings, counted in billions (8B, 70B); more generally means more capability and more hardware.
Quantization, VRAM & GGUF
Quantization (Q4/Q8) - compressing a model so it runs on smaller, cheaper hardware with minimal quality loss. GGUF - a common model file format. VRAM - the memory on a GPU; the single biggest factor in which models you can run and how fast (the model must fit in VRAM).
Tokens, context window & tokens/sec
Tokens - the chunks of text models read and write, roughly ¾ of a word each. Context window - how much text the model can consider at once (its short-term memory), in tokens. Tokens/sec - how fast it generates text; higher feels snappier.
KV-cache & MoE (Mixture of Experts)
KV-cache - a speed optimisation that avoids re-reading earlier text on every word; uses VRAM but makes responses much faster. MoE - a design that activates only the relevant "expert" sub-networks per task, giving big-model quality at lower running cost (e.g. Qwen3-30B-A3B uses ~3B active params of 30B).
RAG, embeddings & vector database
RAG - connecting the model to your own documents so it answers from them, with citations. Embeddings - numeric "fingerprints" of text that let a computer find passages by meaning, not keywords. Vector database - the store (pgvector, Qdrant, Chroma) that holds embeddings and finds the most relevant ones fast: the search index for RAG.
Inference, fine-tuning / LoRA & hallucination
Inference - running the model to get an answer (vs training it). Fine-tuning / LoRA - further-training a model on your data to change its style or specialise it; LoRA is the efficient way to do it without retraining the whole model. Hallucination - when a model states something false but confidently; RAG and grounding are the main defences.
MCP, on-prem vs cloud & the runtimes
MCP (Model Context Protocol) - an open standard that lets the LLM securely connect to tools and data sources. On-prem vs cloud - on-prem runs on hardware you control; cloud runs on someone else's servers via an API. Ollama / vLLM - the engines that serve the model (Ollama simple, vLLM high-throughput); Open WebUI - the ChatGPT-style web interface your staff use.
The honest answers

Local LLM questions you're actually wondering.

Is a local LLM really as good as ChatGPT or Claude?
For everyday business work - drafting, summarising, extraction, classification, internal Q&A, coding help and chat - the 2026 open models (Qwen3.6, Kimi K2.6, GLM-5.1, DeepSeek V4) run at quality very close to the big clouds; Kimi K2.6 even leads open agentic coding. For rare frontier-reasoning tasks we build a hybrid that calls cloud Claude or GPT only when it genuinely wins. You keep almost all the quality and stop paying for everything.
What hardware do I actually need?
It scales with model size: a 7-14B model runs on a 16GB GPU or M4 Mac; a 32B needs a 24-32GB GPU (RTX 3090/4090/5090) or M4 Pro; a 70B needs ~48GB (2× 24GB, RTX 6000 Ada, or a 128GB Mac Studio); 120B-class fits a single RTX PRO 6000 96GB, an A100, a 128GB Mac, or a DGX Spark / Strix Halo mini-PC. Use the VRAM calculator above - or let us size it exactly.
What's the cheapest way to start?
A used RTX 3090 (24GB, ~RM4-6k) or a Mac Mini M4 Pro runs Qwen3-Coder 30B and DeepSeek R1 Distill 32B comfortably - a complete Desk AI build is RM9k-13k. For many one-to-three-person teams that pays for itself inside a year versus cloud seats.
What if a better model launches next year - is my hardware wasted?
No. New open models drop monthly and are free. Swapping today's model for next year's best is a one-line change on the same hardware - no new licence, no migration. GPUs stay useful for several model generations and hold strong resale value.
Can a TPU or Coral stick run a local LLM?
No. The Coral Edge TPU is built for tiny vision models and can't run modern LLMs (no DRAM, no transformer support). Cloud TPUs and Groq/Cerebras are rented cloud services, not on-prem hardware you own. For genuine local hosting the practical accelerators are NVIDIA GPUs, AMD Instinct/Radeon, Apple Silicon, and unified-memory mini-PCs.
Who maintains it? What if it breaks at 2am?
We don't install and vanish. Every build ships with remote health monitoring, free model & runtime upgrades, and a same-business-day response SLA. It's standard open-source (Ollama, vLLM, Open WebUI) - fully documented, your IT trained. You own the box; we keep it running.
How much does it cost to run - power, maintenance?
A Desk/Studio build draws roughly the power of a gaming PC - a few ringgit of electricity a day. The one-time price includes hardware, setup, tuning, integration and training; ongoing cost is electricity plus an optional support plan. We put the full picture into the costed audit.
Can it connect to our tools - email, docs, ERP, WhatsApp?
Yes. Over MCP we wire the local model into email, documents, AutoCount/ERP and WhatsApp - the same integrations as cloud AI, with none of the data leaving your network.
Your move

Stop renting your AI. Own it by next quarter.

Book a free 45-minute Local-AI Audit. We measure your current cloud spend, spec the exact build, and give you the costed break-even date - in writing, no obligation.

No deck pitch. Just engineers sizing your build.

Free Local-AI audit