Engineering

Turning UPI Transactions Into Savings Signals

June 23, 202612 min read

Three bank alerts landed before breakfast, the import job was green, and my Money tab still could not explain why the month was over budget. The culprit was not a bad chart: three recurring UPI debits were hiding as three different merchants, so the ledger had evidence but no trustworthy claim. Raw evidence had never been turned into inspectable, normalized claims.

The framework that fixed it has four stages: raw evidence, normalized claim, inspectable grouping, reversible export. This piece walks through one stage at a time, with the tradeoffs that actually bit me.

TL;DR

UPI transaction analysis becomes useful only after messy bank alerts are converted into a stable read model. Parse alerts into a ledger, normalize merchant names, classify UPI separately from cards and transfers, detect recurrence with explainable rules, then graph the grouped spend so savings opportunities are visible.

Stage 1: Raw evidence starts before the chart

The first wrong instinct is to start with categories. Food, transport, shopping, bills. It feels productive because it produces labels quickly, and labels make dashboards look as if someone has been busy.

Transaction history lies by omission and formatting. A single coffee shop can appear as a VPA, a payment gateway string, a settlement descriptor, or a trimmed subject line in an alert email. A wallet top-up can look like spending. A self-transfer can look like a debit. An alert arriving by email tells you something different from a line in a bank statement, and an SMS database tells you something else again.

The chart is the last mile. The first artifact was a boring pipeline:

That pipeline is the difference between a finance toy and something I would actually use. It shows me the repeated debits that created a spending bucket, the raw evidence behind each grouping, and the places where the grouping is uncertain. Less glamorous than a score, and the only surface that has a chance of being trusted.

The ingestion choice was not philosophical

The source options were bank email alerts, SMS alerts from an iPhone, statement files, and eventually proper account aggregation imports. The tempting question was whether to use IMAP or SMS. The better question was which source gave me stable, cheap, recoverable data without requiring invasive access.

Email won as the first source because the alerts were already arriving and already labeled as Bank-Alerts. That made backfill practical. SMS on iPhone can work, but it quickly drags you into backups, permissions, and the awkward question nobody enjoys answering: why does this personal finance script need broad disk access just to read thirty days of messages?

The hourly job was simple enough to run under launchd, using the same boring scheduling shape I wrote about in Running background jobs on macOS with launchd. And the manual refresh stayed explicit:

launchctl kickstart -k gui/$(id -u)/com.example.money-ledger

Not sophisticated infrastructure. Boring scheduling beats a clever daemon that quietly fails for three days and then asks to be congratulated for retrying.

Backfill first, then freshness

When a Money tab shows nothing for today and yesterday, the question is usually not charting. It is freshness: did the scanner run, did the label match, did the alerts arrive, did parsing reject the message, did the dedupe key collapse two different debits?

I kept backfill and freshness as separate commands because they answer different questions:

cd /srv/money
python3 bin/import_alerts.py --label Bank-Alerts --since 2026-05-17 --backfill
python3 bin/import_alerts.py --label Bank-Alerts --since 2d --verbose

Backfill is idempotent. The freshness check is loud. Combining the two produces the familiar personal automation experience where nothing happens and the logs say ok, which is a tiny form of vandalism.

Stage 2: Normalized claims, not raw strings

The read model needed to preserve raw transactions while creating analysis-friendly keys. Three layers:

Layer	Purpose	Example field
Raw event	Keep source evidence unchanged	`raw_subject`
Normalized transaction	Parse amount, date, rail, direction	`rail = 'upi'`
Spend group	Join aliases into a human label	`merchant_key = 'corner-cafe'`

The schema looked roughly like this:

CREATE TABLE raw_alerts (
  id TEXT PRIMARY KEY,
  source TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL,
  raw_subject TEXT NOT NULL,
  raw_body TEXT NOT NULL,
  parsed_at TIMESTAMPTZ
);
 
CREATE TABLE transactions (
  id TEXT PRIMARY KEY,
  raw_alert_id TEXT REFERENCES raw_alerts(id),
  posted_at TIMESTAMPTZ NOT NULL,
  direction TEXT NOT NULL CHECK (direction IN ('debit', 'credit')),
  rail TEXT NOT NULL CHECK (rail IN ('upi', 'card', 'netbanking', 'transfer', 'unknown')),
  amount_paise INTEGER NOT NULL,
  payee_raw TEXT NOT NULL,
  payee_key TEXT NOT NULL,
  category TEXT,
  confidence NUMERIC NOT NULL DEFAULT 0.0
);
 
CREATE TABLE payee_aliases (
  alias_pattern TEXT PRIMARY KEY,
  payee_key TEXT NOT NULL,
  category TEXT NOT NULL
);

I kept payee_raw because normalization will be wrong. Not sometimes in theory. Wrong in the ordinary way that bank descriptors are built from whatever string happened to survive a payment processor. If the UI cannot show the raw descriptor beside the grouped name, the user cannot correct the system.

A small parser was enough for the first pass:

import re
from decimal import Decimal
 
DEBIT_RE = re.compile(
    r"debited\s+by\s+INR\s+(?P<amount>[0-9,]+(?:\.[0-9]{2})?).*?to\s+(?P<payee>.+?)(?:\.|$)",
    re.IGNORECASE | re.DOTALL,
)
 
UPI_RE = re.compile(r"\b(?:UPI|VPA|@upi|@ok[a-z]+)\b", re.IGNORECASE)
 
def parse_alert(subject: str, body: str) -> dict | None:
    text = f"{subject}\n{body}"
    match = DEBIT_RE.search(text)
    if not match:
        return None
 
    amount = Decimal(match.group("amount").replace(",", ""))
    payee_raw = " ".join(match.group("payee").split())[:180]
    return {
        "direction": "debit",
        "rail": "upi" if UPI_RE.search(text) else "unknown",
        "amount_paise": int(amount * 100),
        "payee_raw": payee_raw,
        "payee_key": normalize_payee(payee_raw),
        "confidence": 0.7,
    }

The refusal to let the regex become the model matters more than the regex itself. Alerts are just one witness. The normalized transaction is the claim. The raw alert remains the evidence.

What one alert actually becomes

To make that concrete: an email arrives with subject Your a/c XX1234 debited by INR 499.00 to STREAMBOX INDIA UPI. The parser pulls amount_paise = 49900, rail = 'upi', payee_raw = 'STREAMBOX INDIA UPI'. The alias table matches (?i).*streambox.* and resolves payee_key = 'streambox', category = 'subscriptions'. The recurrence query then finds two prior rows with the same payee_key and amount_paise, gaps averaging 30 days, and promotes the row to a recurrence candidate flagged for review. At every step the raw_alert_id foreign key stays intact, so clicking the recurrence label in the UI can still surface the original email body. That chain, from noisy descriptor to grouped recurring debit, is what "trust comes from traceability" means in practice.

Stage 3: Inspectable grouping, not silent merges

Grouping payees is the unglamorous center of the work. Without it, repeat spend is invisible. With over-aggressive grouping, the tool invents patterns. Both failures are expensive, because they point the user at the wrong habit.

Deterministic rules came first:

aliases:
  - pattern: "(?i).*corner cafe.*"
    payee_key: "corner-cafe"
    category: "food"
  - pattern: "(?i).*metro pass.*"
    payee_key: "metro-pass"
    category: "transport"
  - pattern: "(?i).*streambox.*"
    payee_key: "streambox"
    category: "subscriptions"
  - pattern: "(?i).*fresh basket.*"
    payee_key: "fresh-basket"
    category: "groceries"

Similarity candidates went into a review queue rather than automatic merges. Similarity is useful for building a queue, not for silently rewriting history.

SELECT
  a.payee_key AS left_key,
  b.payee_key AS right_key,
  similarity(a.payee_key, b.payee_key) AS score
FROM transactions a
JOIN transactions b ON a.payee_key < b.payee_key
WHERE a.rail = 'upi'
  AND b.rail = 'upi'
  AND similarity(a.payee_key, b.payee_key) > 0.62
GROUP BY 1, 2, 3
ORDER BY score DESC;

Grouping is not a cleanup step. It is the place where the system decides whether a habit exists.

That sentence became the operating rule for the UI. Each grouped merchant needs a drill-down. Each category total needs its member transactions. Each suggested merge needs a way to say no.

The harder product judgment sits underneath that rule: deterministic aliases scale further than most people expect, but they eventually run out against long-tail merchant noise. At that point the choice is between a review queue and a learned classifier, and the two fail differently. A review queue produces false negatives quietly: merchants stay ungrouped until a human acts. A learned classifier produces false positives confidently, silently merging two merchants the user knows are different, after which the user loses trust in every number downstream. False negatives are recoverable. A user who sees an ungrouped merchant can fix it. A user who spots a confident wrong merge has to wonder what else the system invented. That asymmetry is why the review queue is the right default, even though it creates ongoing work. The threshold for promoting a similarity candidate to an automatic merge should be extremely high, because the cost of a wrong merge is not one bad row: it is the user's willingness to believe the next recommendation.

Recurrence detection should be boring and inspectable

A recurring debit is not simply the same amount every month. UPI payments have noise: late payments, skipped months, annual renewals, partial refunds, and merchants that bill through gateways.

A plain scoring model won out over a learned classifier:

WITH debits AS (
  SELECT
    payee_key,
    amount_paise,
    posted_at::date AS paid_on,
    lag(posted_at::date) OVER (
      PARTITION BY payee_key, amount_paise
      ORDER BY posted_at
    ) AS previous_paid_on
  FROM transactions
  WHERE direction = 'debit'
    AND rail = 'upi'
), gaps AS (
  SELECT
    payee_key,
    amount_paise,
    paid_on,
    paid_on - previous_paid_on AS gap_days
  FROM debits
  WHERE previous_paid_on IS NOT NULL
)
SELECT
  payee_key,
  amount_paise,
  count(*) AS repeat_count,
  avg(gap_days) AS avg_gap_days,
  stddev(gap_days) AS gap_jitter
FROM gaps
WHERE gap_days BETWEEN 25 AND 35
GROUP BY payee_key, amount_paise
HAVING count(*) >= 2
ORDER BY repeat_count DESC, gap_jitter ASC NULLS LAST;

The thresholds are deliberately dull: same payee, same amount, gaps around a month, at least two repeats. It misses some real subscriptions. Fine. A visible false negative (an ungrouped recurring charge the user can spot and promote) is far less damaging than a confident false positive that labels rent a lifestyle habit. Tunable thresholds get tuned. Opaque scores get abandoned.

Why not a black-box score

A single savings score is seductive because it reduces mess to one number. It also hides the only thing that matters: which assumption produced the recommendation.

Inspection-first views are more useful:

View	Question it answers	Interaction
UPI category bars	Where did UPI debits cluster?	Click category to see payees
Repeat payees	Which merchants recur?	Show dates and amounts
Rail split	Is this UPI, card, or transfer?	Filter by rail
Freshness panel	Is ingestion current?	Show last alert and last parse

A bar chart of lifestyle UPI spend is useful only if I can click into food, see corner-cafe, then see the four raw transactions that made the number. Otherwise it is financial astrology with nicer colors.

Stage 4: Reversible export keeps one authority

I considered using ledger-cli as the source of truth. Plain text, mature tooling, and Fava for a web UI are all genuine advantages. The catch: ledger format is best after classification, not before it.

A reasonable export looks like this:

2026-06-03 * Corner Cafe
  Expenses:Food:Coffee        INR 240.00
  Assets:Bank:Checking       INR -240.00
  ; rail: upi
  ; payee_key: corner-cafe
  ; raw_alert: alert_20260603_091122
 
2026-06-07 * Streambox
  Expenses:Subscriptions      INR 499.00
  Assets:Bank:Checking       INR -499.00
  ; rail: upi
  ; payee_key: streambox
  ; recurrence: monthly

Writing half-parsed alerts directly into a ledger file moves the mess into a format that looks authoritative. That is worse than a database table named raw_alerts, because at least the table admits what it is.

The split that worked: use the database as the ingestion and analysis read model, export clean postings to ledger, and let Fava do what it is good at. That mirrors the broader lesson from Postgres Is the Only Backend You Need: keep the durable model where it can be queried, inspected, and regenerated.

cd /srv/money
python3 bin/export_ledger.py --month 2026-06 --output books/2026-06.ledger
ledger -f books/2026-06.ledger balance Expenses
fava books/2026-06.ledger

Deep-dive: The sharp edge in ledger export

The hard part is reversibility. Once transactions are exported, edits can happen in two places: the app database and the ledger file. I avoided two-way sync for the first version. The database remained canonical for imported transactions, and the ledger export was regenerated. Manual ledger entries lived in a separate file included by Fava.

include generated/2026-06.ledger
include manual/opening-balances.ledger
include manual/cash-adjustments.ledger

That choice costs flexibility, but it avoids reconciling two authorities. Personal finance tools already have enough ways to be subtly wrong.

The freshness check is part of the product

Email alerts are a useful source, but they are not a complete ledger. Old emails can be deleted. Alerts can be delayed. Some transactions may not generate the same template. A bank can change wording and ruin the parser before breakfast.

The importer needs a health view alongside the chart view:

SELECT
  max(received_at) AS last_alert_at,
  max(parsed_at) AS last_parsed_at,
  count(*) FILTER (WHERE parsed_at IS NULL) AS unparsed_alerts,
  count(*) FILTER (WHERE received_at > now() - interval '2 days') AS recent_alerts
FROM raw_alerts
WHERE source = 'email';

This query is part of the product, not an operational nicety. If the user asks why nothing appears for yesterday, the system should answer with evidence: no alerts arrived, alerts arrived but parsing failed, or parsed transactions were classified as non-UPI.

The argument generalizes beyond UPI

These stages apply to any personal automation that touches messy external records. Bank alerts, fitness exports, calendar data: the pattern is the same. Trust depends on two properties. First, reversibility: every output can be regenerated from the inputs without losing information. Second, inspection: at any layer, a user can see what the layer below contributed. A pipeline that satisfies both is correctable. One that satisfies neither is a score with a nice color scheme and no appeal mechanism. UPI is just a concrete instance, and the same discipline serves any tool built on records you do not fully control.

FAQ

Why is UPI merchant grouping unreliable?

UPI descriptors often vary by VPA, gateway, settlement text, and alert template. Deterministic aliases and reviewed similarity suggestions handle most cases; silent automatic merges do not, because they can invent repeat spend patterns that were never there.

Can ledger-cli replace the transaction database?

Ledger-cli is a strong export target and reporting tool, but it should not absorb messy raw alerts directly. Keep parsing, dedupe, normalization, and confidence in a database, then export clean postings to ledger and Fava.

The point of the system is not to make a dashboard that scolds me. It is to make the shape of my own spending legible enough that I can argue with it and, occasionally, lose the argument usefully.