Picture the planning meeting. You are a curator or a head of visitor experience, and someone across the table says it: "Why don't we just get an AI one?" The phrasing suggests it's a swap — rip out the hardware cart, drop in a chatbot, done by summer. That isn't what is actually being proposed, and the gap between what the phrase sounds like and what the category is has become one of the main reasons procurement goes sideways.
An AI audio guide is not a synthesised voice version of your old tour. It is a different product, built around different assumptions, and the decision to buy one is a different decision than the one your predecessor made fifteen years ago with Antenna or Acoustiguide.
What actually counts as an AI audio guide
A useful working definition: an AI audio guide generates narration in real time, from the museum's own content, in whatever language the visitor asks for, and can answer follow-up questions without breaking the tour.
Three of those four matter in a way vendors blur. Real-time generation means the audio doesn't exist until the visitor triggers it. That is what lets the guide handle forty languages without forty voice sessions, and why changing a label means changing data, not booking a recording studio. Grounded in the museum's content means the system is wired to your catalogue, wall text, and curator notes, not the open internet. Curator-shaped voice means someone at your institution writes persona and stop-level direction, the way a film director would — the system performs within your constraints instead of defaulting to Wikipedia-reads-aloud.
We've sat through demos where the "AI audio guide" was a text-to-speech voice on a conventional script, generated once and served from a CDN. That is an audio guide with an AI-adjacent production step. It isn't an AI audio guide in the sense that matters for the category. If the tour can't respond to a visitor question, can't add Portuguese without a production cycle, and can't be updated by editing a row — it's the old product with new marketing. Our piece on how AI-generated audio guides are actually built goes deeper into this distinction.
What the category actually delivers
Four things change when a guide is generated rather than pre-recorded.
Languages stop being a budget line. The marginal cost of adding Catalan or Korean drops to near zero. Museums that were locked into three or four languages for a decade can open up the whole international-visitor segment in a week.
Updates stop being a project. A temporary exhibition used to mean scripts, voice talent, translation, QA — a six-figure ripple in some cases. Now it means loading the show's content and writing a few paragraphs of curatorial direction. First draft live in hours, refinement ongoing.
Questions become part of the tour. Visitors who want to know who commissioned the piece, what pigment that is, or why the frame matters can ask. The system answers from your data and returns to the tour. People who just want to press play can still press play. The experience spans passive to active in the same product.
The content starts talking back to you. Every question visitors ask is a signal. Which rooms generate the most curiosity, which stops lose people at the 30-second mark, which objects in your Roman gallery nobody ever asks about. That data flows back into exhibition design in a way a recorded guide never allowed. Curators we work with have used a single month of guide analytics to justify rewriting wall text that had been in place since 2011.
The combined effect is the thing worth sitting with. Any one of these — multilingual reach, faster updates, Q&A, visitor-level signal — would be a reasonable upgrade on its own. Together, they stop being an improved audio guide and start being a different shape of visitor interpretation. Procurement teams who evaluate AI guides against the old checklist (devices, battery life, queue time at the counter) end up measuring the wrong product.
Where it doesn't fit
This is the part vendors won't say clearly. An AI audio guide is the wrong buy in at least three situations.
If you already have a celebrated guide — Attenborough narrating the natural history collection, a literary figure walking visitors through a house museum — a generative system replacing it is not an upgrade. The voice is the product. In that case the AI guide sits alongside, handling languages you don't record in and questions the celebrity version can't answer. Not a replacement.
If your collection is small and your wall text is excellent, the case weakens. A 20-work single-room gallery doesn't need real-time narration. Visitors can read everything faster than they can open the guide. Save AI for places where visitors would otherwise be lost.
If your procurement timeline is "deploy and forget for ten years," the category will frustrate you. These platforms improve monthly. Inference costs keep falling, voice quality keeps closing on human parity, and feature surface keeps expanding. Museums that treat this as a one-time install miss the point. The value compounds with use — the tours you design, the personas you refine, the stop-level prompts you tune based on what visitors actually do.
The economics question, handled honestly
Legacy audio guide procurement was a capital expenditure. Scriptwriting, voice talent, translation into each language, hardware fleet, charging stations, cleaning, a counter with staff — a mid-sized museum could be out fifty to a hundred thousand before the doors opened, then running ongoing maintenance on a device fleet that started dying in year three.
AI audio guides compress the upfront side hard. Setup is typically in the low thousands. There is still design work — building out tours, personas, stop-level direction — but no scripts per stop, no recording sessions, no translations commissioned individually. The cost that used to sit in production shifts into inference: every visitor interaction runs a language model and a text-to-speech call, and those calls have a price.
The interesting part is how that cost is packaged. Platforms like Musa price on a per-interaction basis or a revenue share against a small visitor-facing fee. That changes the shape of the procurement conversation. Instead of committing a fixed sum against uncertain adoption, the museum pays in proportion to actual use. A guide nobody opens costs nearly nothing. A guide every visitor uses has already paid for itself through the revenue split or the per-use fee it generates. The risk of a five-figure commitment to a system visitors ignore — the failure mode that has haunted audio guide procurement for twenty years — largely goes away.
For a grounded comparison between these arrangements, our audio guide pricing models piece lays out the tradeoffs of subscription, usage, and revenue share side by side.
How to evaluate vendors without getting sold
A few questions that separate real AI audio guide platforms from branded chatbots:
"Show me a live demo on a museum that isn't yours." A staged demo on the vendor's sandbox collection tells you nothing. Ask to try a working installation at a real site, ideally one with content pressure — lots of stops, multiple languages, temporary exhibitions. If they can't point to one, they haven't shipped one.
"What happens when a visitor asks a question that goes beyond our data?" There are three acceptable answers and one unacceptable one. Acceptable: the system refuses, the system answers from general knowledge and labels it as such, or the system says "the museum hasn't spoken to that — here is the closest thing in your materials." Unacceptable: a confident answer with no provenance. Ask them to demonstrate the difference.
"How does a curator shape the voice?" The answer should involve multiple layers — institutional persona, tour-level narrative direction, per-stop instructions. If the answer is "pick a voice from a dropdown and upload your data," the system is running on defaults. That is the path to Wikipedia-reads-aloud narration, no matter how good the underlying model is.
"Demo it in a language I can evaluate, and a language my visitors speak that you probably haven't optimised for." Basque, Catalan, Welsh, Finnish. If the output is noticeably worse than the English demo, you've learned something the brochure wouldn't tell you.
"What does your analytics export look like?" If there isn't one, or if it's a PDF of pie charts, the platform isn't taking the data feedback loop seriously. You want raw, per-stop, per-question data that you can pipe into your own exhibition planning.
What to put in the RFP
If you're writing the procurement document, a handful of clauses earn their place. Our full audio guide RFP guide covers this properly, but the AI-specific asks are worth flagging here.
Grounding architecture. Require the vendor to describe, in plain language, how their system prevents fabrication. "We use GPT-X" isn't an architecture. Retrieval, prompt layering, and citation behaviour are.
Content ownership and portability. The tours you design, the personas you write, the instruction sets you refine — those are institutional assets. Make sure you own them and can take them with you if you switch vendors. This is where lock-in hides.
Language floor. Specify the languages you need at native quality, not the count of languages the platform supports. "40+ languages" is a marketing number. "Verified native-level output in the fifteen languages our visitors actually arrive in" is a procurement number.
Update cadence and versioning. How quickly can a curator push a change live? Is there a review workflow? Can you roll back? Can two curators work in parallel without stepping on each other?
Pricing alignment. Prefer models that tie vendor revenue to visitor use. Fixed annual fees regardless of adoption create perverse incentives — the vendor is paid whether or not the guide is any good.
Pilot clause. Build in a real-world pilot before the full contract. Not a sandbox, not a demo. A live installation with real visitors and a defined set of success metrics. We wrote up how to structure one in how to pilot an AI museum guide.
The decision, reduced
If your current guide is loved, the languages you need are covered, and rotating content isn't a pain point, there is no urgency. Run the RFP in 2027.
If any of those are broken — visitors asking for languages you don't have, temporary exhibitions shipping without interpretation, a hardware fleet that's a line item in three departments' budgets — an AI audio guide isn't a category to monitor. It is a live procurement option that now reliably outperforms the old model on the axes that caused you to open the project in the first place.
The next concrete step for most institutions isn't a contract. It's a pilot with real visitors in one gallery, scoped to a defined success metric — adoption rate, session length, revenue per visit, take your pick. The data from four weeks of live use will tell you more than any vendor deck, and it costs almost nothing to find out.