Default answer to every AI problem: "just call the API." GPT-4, Claude, pick one, ship it.
Until you get the bill. Or the latency spikes. Or the model hallucinates on your users' language.
Extract text from ID cards across Southeast Asia. Thai, Vietnamese, Bahasa Indonesia.
Big models? Hallucinations on non-Latin scripts. P99 latency 3-4x higher than P50. Inconsistent outputs that broke their verification pipeline.
So they built their own. 1B parameters. Trained from scratch on their exact use case.
The results were stark. GPT-4 Vision had a P50 latency of 800ms and P99 of 2,400ms. Grab's model hit 416ms P50 and 1,056ms P99. Monthly cost dropped from $15,000 to $2,000 at a million images per month. Accuracy on Southeast Asian IDs went from 94% to 98%. And data stays internal—no more sending ID cards to a third party.
That's $156,000 in annual savings, better accuracy, and faster P99.
When to call the API: You're prototyping or running low volume—under 10K requests a month. The task is generic, like summarization or translation to major languages. You don't have ML engineers on the team. Data can leave your infrastructure. Time to market matters more than cost.
When to train your own: You're in a specific domain—medical, legal, regional languages. You have high volume where API costs exceed infrastructure costs. You have strict latency requirements, like P99 under 500ms. The task is narrow enough that a small model wins. Data must stay internal for compliance or privacy.
The crossover point is usually around 100K+ requests per month. That's when running your own model becomes cheaper than API calls.
Here's the counterintuitive part.
GPT-4 has 1.7 trillion parameters, trained on everything. Grab's model has 1 billion parameters, trained specifically on ID cards.
For ID card text extraction in Southeast Asian languages, GPT-4 hits 94% accuracy. Grab's model hits 98%. Why?
1.7 trillion parameters spread across all tasks means diluted knowledge. 1 billion parameters focused on one task means concentrated expertise. Grab trained on real Southeast Asian ID cards, not internet text. They fine-tuned on edge cases that matter—blur, glare, stamps, worn corners.
A small, focused model trained on your exact task beats a giant general-purpose model. Every time.
Grab didn't just train a model. They built infrastructure.
First, an auto-labeling platform with synthetic data generation, human-in-the-loop validation, and quality scoring. Second, a data pipeline: ingest, clean, annotate, validate, store. They accumulated 500K labeled examples over six months. Third, a three-phase training approach: pre-train, fine-tune, alignment.
Time allocation: model architecture decisions took 20% of the effort. The data pipeline and labeling took 80%.
The model is 20% of the work. The data pipeline is 80%.
If you don't have the data infrastructure, you don't have a model. You have a side project that will never ship.
Before you build, check yourself.
Do you have 100K+ labeled examples? Is the task narrow and well-defined? Will you actually save money at your volume? (Do the math.) Do you have ML engineers who can maintain it? Is latency a hard requirement—P99 under 500ms? Must data stay internal for compliance or privacy?
If you checked four or more, build. Less than four? Keep calling the API.
Most teams should keep calling the API. The ones who shouldn't will know—because they've done the math, hit the latency wall, or can't send data externally.
The best model isn't the biggest one. It's the one that solves your problem without bankrupting you.
— blanho
Stop chasing job titles. Start chasing knowledge. The title doesn't make you competent.
You don't have Netflix's problems. You have 3 developers and a Postgres database.
Everyone's drawing boxes and arrows. Nobody's shipping code. System design matters, but not as much as Twitter thinks.