---
title: "Gaze Control — webcam-only multimodal macOS controller"
url: https://memory.wiki/xH0i5alB
updated: 2026-06-02T17:31:57.749Z
hub: https://memory.wiki/hub/raymindai
bundle_count: 1
concept_count: 12
source: "memory.wiki"
---
# Gaze Control — webcam-only multimodal macOS controller

## Goal
Build a Vision Pro–style controller for a **flat screen (regular Mac display)** driven by
**eyes + head + fingers + gestures**, using **only the built-in webcam** — no extra hardware.

## Hard constraints
- Webcam is the ONLY input sensor. No Tobii, no Leap Motion, no depth cam.
- Target OS: macOS (Apple Silicon).
- The end user (project owner) does not write code directly — explain changes plainly and
  keep the run/test loop simple.

## Core design principle (do not violate)
**Do NOT build a floating gaze cursor.** Webcam gaze is only accurate to ~1–4 cm and jitters
with natural eye saccades — this is a biological ceiling, not a software bug, so more ML will
not fix it. Instead, copy how Vision Pro actually works:
- Gaze selects/highlights the **UI element** you're looking at (focus), it does not paint pixels.
- A **pinch** activates the focused element.
- Use the macOS **Accessibility API (AXUIElement)** to read on-screen element frames so gaze can
  **snap focus to the nearest element**. This is the secret sauce that makes imprecise gaze usable.

## Modality roles
- **Eyes (gaze):** coarse target — which region/element.
- **Head pose:** fine precision adjustment (head is far more stable than gaze).
- **Finger pinch (thumb+index):** primary click/select.
- **Gestures:** swipe = scroll, two-finger = drag, dwell = hover-to-click alternative, etc.

## Stack decision
- **Now (validation):** Python prototype — MediaPipe (FaceMesh + Hands) + OpenCV + pyautogui/pynput.
  Purpose is to nail the *interaction feel*, not to ship. Throwaway-friendly.
- **Later (product):** native Swift app — Vision framework (hand pose, face landmarks) + AVFoundation
  + CGEvent/AXUIElement + a transparent always-on-top overlay window. Only after the feel is proven.

## Current state
- `gaze_pinch_v0.py` — RAW prototype: iris-ratio gaze → cursor, 9-point quadratic calibration,
  One Euro smoothing, thumb-index pinch → click (hysteresis + cooldown). No element snapping yet
  (intentionally raw, to feel the baseline jitter).

## Roadmap
- v1a: Accessibility-API element snapping + focus highlight overlay (biggest felt improvement).
- v1b: head-pose fine-adjust channel (gaze warp → small head move for pixel precision).
- v1c: gesture vocabulary (scroll/drag/back) + dwell-click option.
- v1d: transparent SwiftUI/AppKit overlay for visual feedback.
- v2: port the proven interaction to a native Swift app.

## Run
```
source .venv/bin/activate
python gaze_pinch_v0.py
```
macOS permissions required (System Settings > Privacy & Security): **Camera** and **Accessibility**
for the terminal/app running python — without Accessibility the cursor will not move.

## Testing note
Claude Code cannot see the webcam feed or feel the cursor. Testing is human-in-the-loop: the owner
runs it, reports what feels wrong (drift direction, click misfires, calibration mismatch), and we
adjust the tuning knobs at the top of the script accordingly.


---

## Summary
A macOS controller project uses only the built-in webcam to enable Vision Pro-style interaction through gaze-based element selection combined with head pose, finger pinches, and gestures. Rather than tracking a cursor, gaze selects UI elements via the Accessibility API while pinches activate them, overcoming webcam gaze imprecision by snapping focus to nearby on-screen elements.

## Themes
- webcam-only input
- vision pro interaction model
- accessibility-driven UI snapping

## Key takeaways
- Webcam is the only input sensor, no additional hardware like Tobii or Leap Motion is permitted.
- Gaze selects UI elements via Accessibility API snapping rather than controlling a floating cursor, because webcam gaze accuracy is only 1-4 cm with natural jitter.
- Pinch (thumb and index finger) is the primary activation mechanism for focused elements, copying Vision Pro interaction.
- The current prototype (gaze_pinch_v0.py) includes iris-ratio gaze tracking, 9-point calibration, One Euro smoothing, and pinch detection but lacks element snapping.
- The roadmap prioritizes element snapping and focus highlight overlay as the next phase (v1a) because it will produce the biggest felt improvement.
- macOS Accessibility permissions are required for the script to move the cursor.

## Insights
- Gaze cursor jitter is a biological ceiling not solvable by ML, so the design copies Vision Pro by snapping gaze to nearby UI elements rather than tracking pixels directly.
- The stack deliberately uses throwaway Python prototyping to validate interaction feel before committing to native Swift, treating the current version as intentionally raw to measure baseline performance.
- Head pose is positioned as a fine-adjustment channel after gaze selects a coarse target, inverting typical pointer design where gaze alone drives movement.

## Open questions / gaps
- What specific visual feedback or highlight style will be used to show which UI element is currently focused by gaze?
- How will the head-pose fine-adjust channel in v1b measure and warp gaze data based on head position changes?

## Concepts in this document
- **macOS Accessibility API** _(entity)_
  Integration point for detecting and snapping to UI elements in the production system.
- **Vision framework** _(entity)_
  Apple's native computer vision framework enabling hand pose detection and facial landmark analysis on macOS.
- **MediaPipe** _(entity)_
  Google's ML framework providing FaceMesh and Hands models for real-time landmark detection.
- **Vision Pro interaction paradigm** _(entity)_
  Reference architecture and design philosophy guiding the macOS controller implementation approach.
- **Multimodal input fusion** _(concept)_
  Design pattern combining eyes, head, and fingers to create robust interaction beyond single modality.
- **Pinch Gesture** _(entity)_
  Thumb-index finger proximity detection with hysteresis thresholds for reliable click activation.
- **One Euro Filter** _(entity)_
  Smoothing algorithm with tunable parameters for balancing responsiveness vs noise reduction.
- **9-Point Calibration** _(concept)_
  Quadratic mapping system that transforms raw gaze ratios to accurate screen coordinates.
- **Snap-to-UI-Element** _(concept)_
  Critical technique using accessibility API to map imprecise gaze to actual UI components.
- **Human-Computer Interaction** _(tag)_
  Design discipline focused on natural, intuitive interfaces that match human capabilities.
- **Computer Vision** _(tag)_
  Core technology domain enabling real-time face and hand tracking from webcam input.
- **Multimodal Input Strategy** _(concept)_
  Combining eyes for coarse targeting, head for precision, and fingers for activation in layered approach.

## Concept relations (within this doc's concepts)
- **Multimodal Input Strategy** uses hand component **Pinch Gesture**
- **Multimodal input fusion** includes modality **Pinch Gesture**
- **Vision Pro interaction paradigm** enables through **Snap-to-UI-Element**
- **macOS Accessibility API** enables implementation **Snap-to-UI-Element**

## Bundles containing this document
- [Gaze control macOS prototype](https://memory.wiki/b/BT59F4sC)

_Hub canonical:_ https://memory.wiki/hub/raymindai
_Concept digest:_ https://memory.wiki/raw/hub/raymindai?digest=1&compact=1