Gaze Control — webcam-only multimodal macOS controller

Goal

Build a Vision Pro–style controller for a flat screen (regular Mac display) driven by eyes + head + fingers + gestures, using only the built-in webcam — no extra hardware.

Hard constraints

  • Webcam is the ONLY input sensor. No Tobii, no Leap Motion, no depth cam.
  • Target OS: macOS (Apple Silicon).
  • The end user (project owner) does not write code directly — explain changes plainly and keep the run/test loop simple.

Core design principle (do not violate)

Do NOT build a floating gaze cursor. Webcam gaze is only accurate to ~1–4 cm and jitters with natural eye saccades — this is a biological ceiling, not a software bug, so more ML will not fix it. Instead, copy how Vision Pro actually works:

  • Gaze selects/highlights the UI element you're looking at (focus), it does not paint pixels.
  • A pinch activates the focused element.
  • Use the macOS Accessibility API (AXUIElement) to read on-screen element frames so gaze can snap focus to the nearest element. This is the secret sauce that makes imprecise gaze usable.

Modality roles

  • Eyes (gaze): coarse target — which region/element.
  • Head pose: fine precision adjustment (head is far more stable than gaze).
  • Finger pinch (thumb+index): primary click/select.
  • Gestures: swipe = scroll, two-finger = drag, dwell = hover-to-click alternative, etc.

Stack decision

  • Now (validation): Python prototype — MediaPipe (FaceMesh + Hands) + OpenCV + pyautogui/pynput. Purpose is to nail the interaction feel, not to ship. Throwaway-friendly.
  • Later (product): native Swift app — Vision framework (hand pose, face landmarks) + AVFoundation
    • CGEvent/AXUIElement + a transparent always-on-top overlay window. Only after the feel is proven.

Current state

  • gaze_pinch_v0.py — RAW prototype: iris-ratio gaze → cursor, 9-point quadratic calibration, One Euro smoothing, thumb-index pinch → click (hysteresis + cooldown). No element snapping yet (intentionally raw, to feel the baseline jitter).

Roadmap

  • v1a: Accessibility-API element snapping + focus highlight overlay (biggest felt improvement).
  • v1b: head-pose fine-adjust channel (gaze warp → small head move for pixel precision).
  • v1c: gesture vocabulary (scroll/drag/back) + dwell-click option.
  • v1d: transparent SwiftUI/AppKit overlay for visual feedback.
  • v2: port the proven interaction to a native Swift app.

Run

source .venv/bin/activate python gaze_pinch_v0.py

macOS permissions required (System Settings > Privacy & Security): Camera and Accessibility for the terminal/app running python — without Accessibility the cursor will not move.

Testing note

Claude Code cannot see the webcam feed or feel the cursor. Testing is human-in-the-loop: the owner runs it, reports what feels wrong (drift direction, click misfires, calibration mismatch), and we adjust the tuning knobs at the top of the script accordingly.