Home Blog Design Modality as a design decision – why we started asking about it earlier

Modality as a design decision – why we started asking about it earlier

A few weeks ago I came across an article about an internal dispute at Pinterest – the CEO wanted to go all-in on voice, arguing that Gen Z expects something that feels like “talking to a friend”, while the designers and product leaders pushed back because Pinterest is built around quiet, visual exploration and voice simply doesn’t fit why people go there in the first place.

I’m not bringing this up to take sides. I’m bringing it up because it captures a tension we’re seeing more and more with our clients: the choice of interface modality has become one of the more consequential decisions in product design, and yet many teams make it late, almost in passing, or based on what’s trending – rather than letting it follow from what users actually need and the context in which they operate.

Since at Boldare we work alongside clients throughout the discovery process, these questions started coming up naturally in our workshops, and over time we decided to give them their own dedicated space so there’s actually room to work through them properly.

Modality as a design decision – why we started asking about it earlier

Table of contents

What’s changed

Not long ago, interface modality was essentially given – you designed screens, flows, and components, and the user’s input was obvious: keyboard, click, touch. Today a team can choose voice, text, image, video, documents, audio, or some combination, and that freedom is both an opportunity and a source of some fairly serious design mistakes.

Look at what’s been happening across the industry. Revolut deployed voice agents handling customer support in 30+ languages – voice replaced the traditional IVR tree because conversation is simply more natural than pressing numbers on a phone. Salesforce built Agentforce Contact Center, bringing together voice calls, CRM data, and AI agents in a single flow with real-time transcription. Headspace added Ebb, a voice-based mental health companion that listens to spoken emotions and remembers context across sessions – because voice carries emotional weight that text often can’t.

On the other side, Lyft built the Cosmos vision-language platform to process live camera feeds for driver routing, Miro taught its AI Sidekicks to read the full visual context of a canvas before responding, and Google Stitch lets you speak to a design canvas, upload sketches, and describe the “feeling” of an interface – with the agent holding all of that context at once.

In each of these cases, modality wasn’t a feature add-on or a “because we can” decision – it was the architecture of the product itself, shaped by what users are trying to accomplish and the conditions under which they’re doing it.

Three frameworks that help make sense of it

We didn’t start from scratch here – we took tools we’ve been using for years and started asking an additional question about modality alongside them.

Jobs-to-be-Done: what is the user actually trying to do?

JTBD asks what “job” the user is hiring this product to do – and that question leads surprisingly directly to modality, because different jobs happen in different physical and emotional contexts.

If the job is navigation while moving – driving, running, cycling – voice suggests itself naturally, which is exactly why Google Maps has been one of the most widely used voice interfaces for years: the product was being hired for a job that practically demanded voice. The new “Ask Maps” is a logical extension of the same idea: if users already trust voice for navigation, asking “where can I charge my phone without waiting in line” is just the next natural step.

Otter.ai follows the same logic – the job is understanding what was said in a meeting and extracting value from it, and voice isn’t just the natural modality here, it’s the only one that makes sense, because meetings are inherently audio. So its agents transcribe, coach salespeople in real time, and take autonomous notes.

If the job is precise image editing at a desk, in focus, text beats voice – which is what Adobe did with Photoshop’s AI assistant, where you can type “remove the shadow on the left” or “add a soft glow” and get the result without knowing any tool names or keyboard shortcuts.

What we do in the workshop: we ask clients to describe the top three jobs users are hiring their product to do, and then we ask about the user’s physical and emotional state when doing that job – and that answer very often naturally rules out or points to specific modalities before we’ve designed anything.

Opportunity Solution Tree: modality as a hypothesis, not an assumption

Teresa Torres’s OST teaches you not to fall in love with solutions before you understand the opportunity they’re supposed to address – and the same applies to modality, because many teams decide “we’re adding voice” and then look for the justification, rather than checking whether voice actually responds to a real user need.

Zoom identified an opportunity that could be described as: “users want to communicate across language barriers without disrupting the natural flow of a conversation”, and the answer was a live voice translator doing real-time audio translation – because text wouldn’t cut it here, the point is to preserve naturalness, not transcribe it.

DoorDash went a different way and built DashCLIP, a model aligning product images, text descriptions, and search queries in a shared embedding space, because the opportunity was: “users search for food intuitively and don’t always know how to name what they want” – and image plus text together answers that better than either modality alone.

Instacart went further still, letting customers complete orders directly inside ChatGPT with AI analysing product images and nutritional data for dietary filtering – modality followed from a very specific opportunity: “the user is mid-conversation with AI and wants to act immediately, not jump to another app”.

What we do in the workshop: we add a “modality hypothesis” column to the solution tree next to each solution node, and for each one we ask whether the assumed modality is a real answer to that opportunity, or just a convenient or fashionable one.

AEIOU / Contextual Inquiry: modality lives in context, not in a lab

AEIOU is a technique for observing users in their real environment and it’s probably the best tool for validating modality – precisely because modality doesn’t exist in the abstract, it exists in a specific place, at a specific time, in a specific user state.

Headspace designed Ebb with full awareness of this: a voice-based mental health companion is most valuable at 11pm on a Wednesday, when the user is alone in their bedroom and needs to process a difficult day, and that’s a very different context from Monday morning before work – which is why Headspace lets users switch between voice and text at any moment, because they understand context shifts and the product needs to follow.

Google Docs added audio summaries – Gemini generates a spoken summary of any document in a natural voice with adjustable speed and different narration styles, and the AEIOU context is very specific here: the user wants to absorb a document but has their eyes occupied – driving, exercising, cooking – so audio is the only modality that fits the activity.

Lattice’s AI Meeting Agent took a similar approach: analysing meeting audio to surface turnover risk signals and team health patterns from the sound of the conversation itself – because the managerial context carries emotional weight that a text transcript alone would lose.

What we do in the workshop: we add a modality dimension to the standard AEIOU grid, and for each observed activity we ask which modalities are natural in this context, which are physically impossible, and which would just feel invasive or uncomfortable.

Where the real power is: when modalities work together

The most interesting things happen not when one modality is chosen well, but when several work together and each compensates for what the others lack.

Google Stitch is probably the best current example: a designer can upload a sketch, describe the interface’s “feeling” in text, and say out loud what isn’t working – all in one session, with the agent holding the full context simultaneously, and that’s not just three inputs added together, it’s a qualitatively different way of communicating complex creative ideas.

Replit Agent 4 does the same on the development side: you paste a screenshot of a broken interface, describe in plain English what it should do, and speak corrections as the agent iterates in real time, seeing both the code and the rendered output – a feedback loop that used to require switching between several tools has collapsed into a single session.

This also changes how AI handles ambiguity: in a single-modality system a vague description produces a vague result, but when voice, image, and text work together each one fills in what the others are missing, and the output ends up much closer to what the user actually had in mind.

How this looks in practice for us

Modality Discovery in our Product Discovery Workshop is the stage where we work through these questions together with the client’s team – before anything concrete gets designed. The recommendation we leave with covers which modality to introduce, in what order, how to potentially combine it with others, and what happens to the product if that modality fails for some reason.

It doesn’t always lead to surprising conclusions – sometimes text is the right choice and there’s no reason to complicate things – but asking these questions early, before investment is made, helps avoid the kind of situation the Pinterest story describes.


Anna Zarudzka is CEO at Boldare, a company specialising in product discovery and building digital products for scale-ups and enterprises.


A few questions if you’d like to talk:

  • Is your team facing a decision about introducing a new modality and not sure how to think it through?
  • Are you considering voice, image, or video, but unclear whether the timing is right for your product?
  • How does modality discovery fit into your process – is it something you do explicitly, or more on the side?