EngineeringMarch 13, 20266 min read

Building a Taglish-Aware NLP Parser: Lessons from Elizabeth.ai

A technical deep-dive into building an NLP order parser that handles Taglish (Filipino-English code-switching), abbreviations, misspellings, and emoji-based ordering.

By Elizabeth.ai Team

The Taglish Problem in NLP

Taglish — the fluid mix of Filipino (Tagalog) and English that 90+ million Filipinos use daily — is one of the hardest code-switching challenges in natural language processing. Unlike formal bilingual text where language switches happen at sentence boundaries, Taglish switches mid-sentence, mid-phrase, and sometimes mid-word.

Consider this real order message:

"Hi ate, pa-order po ng 2 pcs chicken adobo tsaka 3 sinigang na baboy. Gcash po ako magbabayad. Deliver po sa may 7-11 sa Balibago."

In one message we have:

English product names ("chicken adobo") with Filipino connectors ("tsaka" = and)
Filipino honorifics ("ate" = older sister, "po" = polite particle)
Abbreviated units ("pcs" = pieces)
Brand names used as landmarks ("7-11 sa Balibago")
Payment method mentioned casually mid-sentence
Filipino verb conjugation ("magbabayad" = will pay)

Standard NLP pipelines — tokenizers, named entity recognizers, intent classifiers — are trained on monolingual corpora. They handle English well. They handle Filipino decently. They handle Taglish poorly, because the statistical models break down when two languages interleave unpredictably.

Our Approach: Hybrid Regex + AI

We rejected the idea of building a single monolithic NLP model for Taglish order parsing. Instead, we built a two-layer hybrid system.

Layer 1: Regex Parser (Zero AI Cost)

The regex parser is a collection of carefully crafted patterns that catch structured order formats. It handles roughly 60% of incoming orders.

Why regex first? Because most orders follow predictable patterns:

{quantity} {unit}? {product} — "2 pcs adobo", "3 sinigang"
{product} x{quantity} — "adobo x2"
{product} {quantity} — "ube cake 3"

These patterns work regardless of language because the structure is universal. A customer typing "2 chicken adobo" follows the same pattern as "2 adobo ng manok" — the regex captures the quantity and product tokens without needing to understand Filipino grammar.

The parser config is merchant-specific. Each seller registers their product names, common abbreviations, and variants. The regex engine matches against this catalog, handling:

Abbreviations: "pcs", "dz" (dozen), "box"
Filipino quantity words: "dalawa" (2), "tatlo" (3)
Common misspellings: Levenshtein distance matching within configured thresholds
Emoji quantities: "chicken adobo x 2" represented with visual cues

The key design decision: the regex parser never calls any AI service. It either matches with confidence or it escalates. This keeps 60% of order processing at zero marginal cost.

Layer 2: AI Cascade (Complex Orders)

For the 40% of orders that the regex parser cannot confidently handle — ambiguous quantities, unclear product references, heavily conversational Taglish — we escalate to an AI cascade.

The cascade routes through three providers sequentially:

Claude Sonnet 4 — Primary provider, excellent at structured extraction and Taglish comprehension
Gemini 2.5 Flash — Fast fallback with strong multilingual capabilities
GPT-4o — Final fallback with broad language coverage

Each provider is wrapped in a circuit breaker (see our circuit breaker design post). If a provider is down or returning errors, the circuit breaker trips and the cascade skips to the next healthy provider automatically.

The AI receives a structured prompt with:

The raw customer message
The merchant's product catalog
Parsing instructions with Taglish-specific examples
Expected output schema (JSON with items, quantities, notes)

Why Not AI-Only?

Cost. At scale, sending every order through an AI API becomes expensive. A seller processing 5,000 orders per month would generate significant API costs if every message required a Claude or GPT call. By handling the majority with regex, we keep per-order costs negligible for most messages and reserve AI budget for the messages that genuinely need it.

Taglish-Specific Challenges We Solved

Challenge 1: Code-Switching Boundaries

Taglish does not switch at predictable boundaries. "Pa-order po ng 2 pcs chicken" mixes Filipino particles ("pa-", "po", "ng") with English nouns ("chicken") and abbreviated English units ("pcs"). Our tokenizer treats the entire message as one stream rather than trying to detect language boundaries.

Challenge 2: Filipino Abbreviations and Slang

Filipino internet slang is creative and constantly evolving:

"po" → polite particle (not a product)
"lang" → "only" (not a quantity modifier)
"pa" → prefix meaning "please" or "additional"
"tsaka" / "saka" → "and" / "also"
"pang" → "for"

Our parser config includes a stop-word list of Filipino particles that should be ignored during quantity/product extraction, along with connectors that signal additional items.

Challenge 3: Product Name Variations

A single product might be referenced as:

"chicken adobo"
"adobong manok"
"adobo chicken"
"adobo (chicken)"
"adobo c"

The parser config allows merchants to register aliases and the regex engine normalizes against them. For fuzzy matches, we apply Levenshtein distance within a configurable threshold per product.

Challenge 4: Implicit Quantities

"Mine po" (I want it) or "Kuha ako nito" (I'll take this) imply a quantity of 1 without stating a number. The regex parser recognizes these claim patterns and defaults to quantity 1. The AI layer handles more ambiguous implicit cases.

Challenge 5: Multi-Intent Messages

A single message might contain an order, a question, and a payment method mention:

"Order po ng 2 adobo. May sinigang pa ba? GCash po."

Our intent classifier (pure regex, no AI) segments the message into intents before routing. The order portion goes to the parser, the question goes to the FAQ handler, and the payment mention is tagged for the confirmation flow.

Lessons Learned

Start with regex, add AI. Not the other way around. Getting regex patterns right for 60% of cases was far more impactful than improving AI accuracy from 90% to 95%.
Merchant-specific configs are essential. There is no universal Filipino product dictionary. Each seller's catalog, abbreviations, and customer slang are unique. The parser config system makes the engine adaptable without code changes.
Taglish is a spectrum, not a binary. Some sellers communicate in near-pure Filipino, others in near-pure English. The parser cannot assume a single language mix ratio — it must handle the full spectrum.
Test with real messages. Synthetic Taglish test data is unconvincing. We built our test suite from realistic customer message patterns based on how Filipino buyers actually communicate, which revealed patterns we never would have anticipated.

For more on our architecture decisions, see Hybrid Regex + AI: Why We Parse 60% Without AI and our about page.

Update: Our parser has evolved since this post — structured confidence scoring, AI hallucination guards, and per-merchant language configuration. Read 4 Things We Got Wrong in Our First Hybrid Parser for what changed and why.

Interested in the technology behind Filipino commerce automation? Learn more about Elizabeth.ai or explore our other engineering posts.