Gaze Control — webcam-only multimodal macOS controller
Goal
Build a Vision Pro–style controller for a flat screen (regular Mac display) driven by eyes + head + fingers + gestures, using only the built-in webcam — no extra hardware.
Hard constraints
- Webcam is the ONLY input sensor. No Tobii, no Leap Motion, no depth cam.
- Target OS: macOS (Apple Silicon).
- The end user (project owner) does not write code directly — explain changes plainly and keep the run/test loop simple.
Core design principle (do not violate)
Do NOT build a floating gaze cursor. Webcam gaze is only accurate to ~1–4 cm and jitters with natural eye saccades — this is a biological ceiling, not a software bug, so more ML will not fix it. Instead, copy how Vision Pro actually works:
- Gaze selects/highlights the UI element you're looking at (focus), it does not paint pixels.
- A pinch activates the focused element.
- Use the macOS Accessibility API (AXUIElement) to read on-screen element frames so gaze can snap focus to the nearest element. This is the secret sauce that makes imprecise gaze usable.
Modality roles
- Eyes (gaze): coarse target — which region/element.
- Head pose: fine precision adjustment (head is far more stable than gaze).
- Finger pinch (thumb+index): primary click/select.
- Gestures: swipe = scroll, two-finger = drag, dwell = hover-to-click alternative, etc.
Stack decision
- Now (validation): Python prototype — MediaPipe (FaceMesh + Hands) + OpenCV + pyautogui/pynput. Purpose is to nail the interaction feel, not to ship. Throwaway-friendly.
- Later (product): native Swift app — Vision framework (hand pose, face landmarks) + AVFoundation
- CGEvent/AXUIElement + a transparent always-on-top overlay window. Only after the feel is proven.
Current state
gaze_pinch_v0.py— RAW prototype: iris-ratio gaze → cursor, 9-point quadratic calibration, One Euro smoothing, thumb-index pinch → click (hysteresis + cooldown). No element snapping yet (intentionally raw, to feel the baseline jitter).
Roadmap
- v1a: Accessibility-API element snapping + focus highlight overlay (biggest felt improvement).
- v1b: head-pose fine-adjust channel (gaze warp → small head move for pixel precision).
- v1c: gesture vocabulary (scroll/drag/back) + dwell-click option.
- v1d: transparent SwiftUI/AppKit overlay for visual feedback.
- v2: port the proven interaction to a native Swift app.
Run
source .venv/bin/activate
python gaze_pinch_v0.py
macOS permissions required (System Settings > Privacy & Security): Camera and Accessibility for the terminal/app running python — without Accessibility the cursor will not move.
Testing note
Claude Code cannot see the webcam feed or feel the cursor. Testing is human-in-the-loop: the owner runs it, reports what feels wrong (drift direction, click misfires, calibration mismatch), and we adjust the tuning knobs at the top of the script accordingly.