The Hallucinations Hiding in My Pipeline

Last month I asked Claude to write a dbt model for me. The SQL looked clean. The join made sense. I shipped it.

Three weeks later I found out it was silently dropping 8% of the rows. An INNER JOIN where I needed a LEFT JOIN. Nobody caught it. The dashboard kept looking fine because the missing 8% happened to not move the top-line number much.

This is the problem with AI in data work. It does not know what it does not know. And it sounds confident either way.

Where it bites me

Two places, every week.

ETL. Joins on the wrong key. Filters that look right but silently drop edge cases. Type conversions that cast a timestamp to a date and kill the timezone. Window functions that use ORDER BY in a way that is “correct” but not what you meant. The code runs. The numbers come back. They look fine. They are not fine.

Modeling. Feature engineering with leakage. “Cross-validation” written in a way that sees future data. A regression that drops rows with nulls without telling you. A metric computed on a different grain than the label. The model trains. The AUC is high. It will fail on real data and you will find out in production.

The common thread: plausibility is not correctness. AI has gotten very good at plausible. A wrong answer that reads well is more dangerous than a wrong answer that reads badly, because you skip the review.

Your review is the whole job now

Here is what I keep telling junior people on my team:

The AI is only as good as you are at reviewing it.

A senior person with AI ships 10x faster. A junior person with AI ships 10x the bugs. Same tool, opposite outcome. The difference is not the prompt. It is the eye.

When I review AI output I do the same things I would do for any untrusted intern:

Read every line. Not skim. Read. Say what it does in my own words.
Run it on a known case. One I already know the answer to. Does it match?
Check row counts before and after. Surprisingly often, the bug is right there.
Look for silent failures. Missing keys, dropped nulls, implicit casts. These never throw. They just eat your data.
Ask “what did this assume”. AI makes assumptions. It will not tell you. You have to dig them out.

This takes time. It is the job.

The new skill

People talk about prompt engineering like it is the thing to learn. It is not. The thing to learn is faster error detection. How quickly can you spot the lie in a page of generated code? How quickly can you feel the wrongness in a plausible-looking number?

That skill is not new. It is the old skill: reading code, knowing your data, having taste. The AI just raises the stakes. You now produce more code per day, which means more chances to be wrong per day.

I still use AI every day. I would not go back. But I have stopped treating its output as an answer. I treat it as a first draft from an intern who is very fast, very articulate, and sometimes completely wrong. My job is the red pen.

If you are not running that red pen, the AI is not helping you. It is just helping you be confidently incorrect, faster.