Agency plus automation: Designing artificial intelligence into interactive systems

Key Takeaways

Heer argues the dominant rhetoric of full automation is myopic: it misreads what people actually need, ignores the human labor sustaining 'automated' services, and cedes a richer design space where computation augments rather than replaces intellectual work. Pure automation also fails under biased data, fixed objectives that don't adapt, and adversarial manipulation, and it erodes users' critical engagement — with real-world consequences ranging from flawed Google Flu predictions to Lion Air-style crashes where automated systems and crew failed to coordinate.
The integration problem is not new: Bar-Hillel framed it in 1960 as 'determining the region of optimality in the continuum of possible divisions of labor,' and the 1990s direct-manipulation-vs-agents debate concluded in favor of automation that amplifies productivity while preserving user control and responsibility. Heer's contribution is a concrete strategy: design shared representations of possible actions so the system can reason about the user's task while the interface lets people review, select, revise, or dismiss algorithmic suggestions.
Everyday augmentations (spell/grammar check, query autocomplete) already exhibit the core principles: they add value via efficiency and surfacing alternatives, augment without replacing, require neither perfect accuracy nor a complete task model to be useful (spelling is a subtask of writing), and let both humans and machines learn incrementally from interaction. These, plus Horvitz's 'Principles of Mixed-Initiative User Interfaces,' form the design baseline.
Data Wrangler (commercialized as Trifacta) uses 'predictive interaction': simple selections on rows/columns/cells trigger ranked transform suggestions drawn from Wrangle, a domain-specific language that doubles as a formal search space. Users can still write code, but typically engage a guide-decide loop where selections provide evidence of intent. Critically, proactive suggestions shown before any user interaction were rejected — but the same suggestions were accepted when interleaved with reactive ones, because users saw themselves as the initiator. Trifacta later invested heavily in structured graphical editors so users could fix 'close, but not perfect' suggestions.
Voyager treats Vega-Lite as the shared representation and uses logic programming plus perceptual-effectiveness rankings (learned from human studies, retrainable) to recommend related views seeded from the user's current focus chart — not rampant data mining. Studies showed recommendations shifted users from depth-first exploration to broader coverage, and recommended views accounted for a significant share of charts bookmarked as interesting. One participant warned the suggestions were 'so good but it's also spoiling that I start thinking less' — an early signal of the agency/automation trade-off.
Predictive Translation Memory (commercialized as Lilt) seeds a text editor with machine translation, then retranslates untouched spans in real time as the human edits, while user edits incrementally retrain the MT engine for in-session domain adaptation. Over 99% of characters were entered via interactive aids; PTM beat post-editing on BLEU+1 quality and the MT engine improved more from fine-grained PTM inputs than from post-edits. Interaction details mattered: translation updates had to arrive within 300 ms or the interface felt 'sluggish,' and visualizing raw source-target word alignments was cut as distracting.
Shared representations sit on a spectrum. Wrangler and Voyager expose an explicit DSL plus predictive model plus graphical mapping; PTM exposes only text because translators don't consciously reason in parse trees — and that text-only interface survived a complete swap from beam search to neural MT without interface rework. Choice of representation depends on task, data, and user mental models. In all three, the collaboration is deliberately asymmetric: the machine proposes, the human decides.
Automation can reshape behavior in uncomfortable ways. PTM translators reported MT 'puts words in my head' and reduces creativity, and produced more consistent (but more primed) translations — which individual translators may dislike while customers buying team output may prefer. Heer frames this as a genuine design trade-off, not a solved problem, and calls for evaluations that cover perceived autonomy, skill acquisition ('level up, not deskill'), and whether designs prompt critical engagement or passive acceptance throughout the deployment life cycle — not just task time and quality at launch.
Open frontiers Heer flags: UI toolkits that bundle task modeling, inference, learning, and model monitoring so AI-infused systems don't each require multi-year bespoke builds; methods for designers and end users to inspect models, do error analysis, and track drift; learning shared representations from data (via VAEs, GANs, word embeddings, latent spaces as in astronomy and genomics) rather than hand-engineering them; and principled accounting for bias from limited or unrepresentative input data. The recurring user demand across commercialized deployments was more control, not less — forcing finer-grained interactive specification tools.

Agency plus automation: Designing artificial intelligence into interactive systems

Alternative Titles

Key Takeaways