Show HN: Cua Driver – background multi-cursor via macOS SkyLight.framework

2 pointsposted 7 hours ago
by frabonacci

1 Comments

frabonacci

7 hours ago

Hi HN, Francesco from Cua here. I hacked this together over a weekend after getting curious about whether macOS could support real background computer-use outside a single vendor's agent product.

The first thing we are using it for is recording product demos. We used to use Screen Studio; now we ask Claude Code + cua-driver to drive the app while cua-driver recording start captures the trajectory, screenshots, actions, and click markers. We canceled our Screen Studio subscription, which started as a joke and then became true.

The problem: most GUI agents still assume the desktop has one shared cursor, one focused app, and one human who is okay being interrupted. That makes local desktop agents awkward. The agent can do the task, but it steals your screen while doing it.

cua-driver is our attempt to make background computer-use a commodity primitive for macOS: let an agent drive a real Mac app while your cursor, focus, and Space stay where they are. The default interface is a CLI, so it is easy to script, easy for coding agents to call from a shell, and still compatible with MCP clients when you want that.

You can try it on macOS 14+:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-d...)" CLI example:

cua-driver serve &

cua-driver recording start ~/cua-trajectories/demo1

cua-driver launch_app '{"bundle_id":"com.apple.calculator"}'

cua-driver list_windows '{"pid":12345}'

cua-driver get_window_state '{"pid":12345,"window_id":67890}'

cua-driver click '{"pid":12345,"window_id":67890,"element_index":14}'

cua-driver recording stop

The recording command writes turn-NNNNN/ folders with the post-action app state, screenshot, action JSON, and a click.png marker overlay for click-family actions. You can replay a saved run with cua-driver replay_trajectory '{"dir":"~/cua-trajectories/demo1"}', which is useful for regression captures even when you are not trying to make a polished marketing video.

What made this harder than expected:

- CGEventPost warps the cursor (it goes through the HID stream, same one your physical mouse uses)

- CGEvent.postToPid doesn't warp the cursor but Chromium silently drops the event at the renderer IPC boundary

- Activating the target first raises the window AND drags you across Spaces on multi-monitor setups

- Electron apps stop keeping useful AX trees alive when their windows are occluded, unless you register the observer through a private remote-aware SPI

The unlock was a private Apple framework called SkyLight. SLEventPostToPid is a sibling of the public per-pid call, but it travels through a WindowServer channel Chromium accepts as trusted. Pair it with yabai's focus-without-raise pattern (two SLPSPostEventRecordTo calls, deliberately skip SLPSSetFrontProcessWithOptions) plus an off-screen primer click at (-1, -1) to tick Chromium's user-activation gate, and the click lands without the window ever raising.

The thing we learned while building it: the primary addressing mode should not be pixels. cua-driver exposes ax, vision, and som (set-of-marks) modes, but element-indexed AX actions are the happy path. Pixels are the fallback for canvas/WebGL/video surfaces. That makes agents much less brittle because they can click "the Send button" instead of guessing coordinates, while still having a screenshot when the AX tree is ambiguous.

Other things we have used it for:

- A dev-loop QA agent that reproduces a visual bug, edits code, rebuilds, and verifies the UI while my editor stays frontmost

- A personal-assistant style flow that sends a Messages reply without switching Spaces

- Pulling visual context from Chrome/Figma/Preview/YouTube windows I am not looking at

Long technical writeup: https://github.com/trycua/cua/blob/main/blog/inside-macos-wi...

I would especially like feedback from people building Mac automation, agent harnesses, MCP clients, or accessibility tooling. If you try it and it breaks on an app you care about, that is useful data.