How an hour of focused collaboration — part description, part troubleshooting, part domain knowledge — produced a working Python automation tool from scratch.
The person who built this tool is not a software engineer. But they are an inventor — someone with hands-on experience in image processing and OCR, familiar with how cameras and sensors interpret the physical world, and comfortable thinking in systems. They knew what they wanted. They just needed someone to write it.
The problem was genuinely tedious: a mobile game mini-game called the Sushi Station requires constant clicking, dragging, and grid management to be effective. Boring to do manually. Interesting to automate.
"I would like to create a Sushi combiner application that is an advanced auto clicker based on visual items on the screen... A screenshot is supplied of the screen setup currently."
That was enough. A clear goal, a screenshot, and domain context. The AI did not need more than that to produce a first working skeleton.
This was not "prompt in, perfect app out." It was a genuine back-and-forth — the inventor describing, correcting, testing, and redirecting. The AI generated code, explained decisions, diagnosed errors, and responded to pushback. Neither side could have done this alone.
The pixel coordinates were not auto-detected. The inventor eyeballed the game, compared the debug overlay image to the actual grid, and corrected the values manually. That domain judgment — knowing what "correct" looks like — came from the human, not the AI.
The OCR approach — initially chosen by the AI — struggled with the game's styled fonts. The inventor's observation about color-coded tiers led directly to the three-strategy pipeline. The template matching idea was the AI's response to the inventor's push. The inventor did not need to know how template matching works. They needed to know that OCR was failing and say so clearly.
Good collaboration includes knowing when something isn't working and being willing to try a different angle. These were the four significant pivots in this session.
The v3 and v4 cascade strategy — organizing the board into a descending staircase to chain 100% tier-ups — was architecturally correct and mathematically sound. But it required precise board state, reliable OCR of 45 cells, and timing coordination that introduced more failure points than it eliminated. The simpler loop from v1 remained the most reliable daily driver. The cascade work was genuinely valuable as design exploration; it just wasn't the final answer.
One thing the AI built in from the start — without being asked — was a suite of debug image exports. These turned calibration from guesswork into evidence.
This is a pattern worth noting: the AI anticipated what would be hard to debug and built the tools to debug it before any problem was reported. Every time the inventor said "something is wrong," there was already an image file that showed exactly what the tool was seeing. That feedback loop cut troubleshooting time dramatically.
It would be dishonest to say the AI did everything. The inventor's background in image processing and OCR shaped the entire session in ways that a complete beginner would have struggled to replicate.
| What the inventor knew | How it shaped the session |
|---|---|
| OCR fails on stylized fonts | Immediately recognized the "13 reads as 3" problem as a classic OCR noise issue, not a bug. Described the font color pattern (white/amber/yellow by tier) — exactly the information the AI needed to design the multi-strategy pipeline. |
| What a grid overlay should look like | Knew that the green lines in dbg_grid.png should align with the tile borders. Could evaluate the image at a glance and report "off by one column to the right" instead of just "it doesn't work." |
| The cascade mechanic logic | Discovered through play that a perfectly descending staircase triggers 100% chain upgrades. Communicated the game mechanic precisely enough for the AI to turn it into a priority-based decision algorithm. |
| When to stop and simplify | Recognized that the sophisticated v4 cascade sorter introduced more fragility than it was worth. Made the call to run v1 in practice. That judgment call is engineering, not just coding. |
| Knowing what "correct" looks like | Every test run, the inventor could evaluate the output against the game state. "It merged the wrong pair" or "it scanned row 0 but missed rows 1 and 2" — that ground truth evaluation came entirely from the human. |
Despite the complexity challenges, there were clear moments of success. The cascade behavior — drag one tile onto a matching tile and watch the whole board upgrade — worked exactly as predicted. This is what it looked like:
The board state the inventor was working toward — shown below — required precise scanning across 45 cells and reliable tier recognition up to T33. That's where the scanning accuracy and template matching work paid off directly.
The architecture of this tool — capture screen, read values, make a decision, act with the mouse — is not unique to games. It appears in almost every hardware development workflow that touches a computer.
A camera watches parts moving past. The tool scans for defect signatures using template matching — the same TemplateBank used here for sushi badges. When a match exceeds a confidence threshold, it flags the part and logs the image. No machine vision library expertise required to get the first prototype running.
A power supply shows voltage and current on an LCD. No USB data port. The same OCR pipeline used to read sushi tier numbers reads the instrument display — same crop, threshold, and Tesseract call. Values log to CSV every second. The FuelReader class from this project does exactly this for the game's resource counter.
A piece-testing machine has a Windows UI with Start, Stop, and a Pass/Fail badge. Replacing manual clicks with the pyautogui layer used here means the rig runs unattended: reads the badge color, clicks Next on Pass, pauses for human review on Fail. The decision logic is simpler than the sushi combiner — two states instead of fifty tile values.
PDFs rendered to images. The grid-scan loop treats each table row as a "cell." Template matching identifies header formats. OCR reads the values. Output goes to a spreadsheet or directly to a KiCad BOM. The same "scan grid → read values → log" loop that runs 48 cells of sushi handles 20 rows of a pin-assignment table.
A working prototype of something non-trivial in under an hour. Code that you can read, modify, and extend without starting over. An architecture that separates config, perception, decision, and action — ready to swap pieces as needs change. A collaborative partner that writes boilerplate, suggests alternatives, and explains tradeoffs on demand.
Domain knowledge. You must know what "correct" looks like in your problem space. You must be able to evaluate output — not as a programmer, but as someone who understands the domain. You must be willing to push back when something isn't working and describe what's wrong precisely enough for the AI to respond usefully.
Boilerplate: config systems, GUI scaffolding, file I/O, logging, threading. Library discovery: the AI knows which Python library handles screen capture, which handles OCR, which handles mouse control — and writes the glue code between them. Error diagnosis: paste a traceback and get a root cause, not a Stack Overflow link.
Calibration to your specific physical setup. Judgment calls about when a "theoretically better" approach is too complex to be reliable. Testing against real hardware. Recognizing when the AI's confident output is wrong because you're the only one who can see the screen.
This is not a story about AI replacing skilled people. It's a story about a skilled person with a clear problem getting a capable coding partner on demand. The inventor's background in image processing let them recognize OCR failures, diagnose alignment issues, and evaluate cascade behavior — none of that came from the AI.
What the AI provided: speed. The architecture that would have taken days to research and write took an hour of conversation. The debugging that would have required forum searches and documentation reading happened in real time. The pivot from OCR to template matching was a one-exchange decision.
Something worked in an hour. It wasn't perfect. It was real. That's the point.