Teaching AI Assistants What Your Website Does: A Practical Guide to Retrieval Optimisation
A practical guide to making a website readable by AI assistants using robots.txt, sitemap.xml, llms.txt, static article mirrors, and a dedicated AI context file — without hiding claims or playing SEO games.
Teaching AI Assistants What Your Website Does: A Practical Guide to Retrieval Optimisation
If you run a technical website and you haven't thought about how AI assistants see it, you're leaving money on the table — and probably confusing your own agents. I learned this the hard way when I pointed my Hermes agent at my own site and realised it had no reliable map of what mattered, what was current, or what it was allowed to do with the information.
Let me show you the stack I now run at Hard Interference. It's not complicated. It's not hype. It's a handful of public text files, static mirrors, and one policy decision that turn your site from a black box into something an AI assistant can actually use.
Why Bother?
AI assistants — ChatGPT, Claude, Gemini, your local Hermes agent — are getting better at reading websites. But they don't browse like you do. They don't see JavaScript SPAs, hash routes, or pages that need three clicks to reach. They hit a URL, read the HTML (maybe), and move on. If your content doesn't exist as plain readable text at a stable URL, it might as well not exist.
This matters whether you want AI to cite your work, answer questions about your product, or execute tasks against your documentation. The question isn't "should I let AI crawlers in?" — they're coming either way. The question is how you make your content easy to find, easy to cite, and easy to understand, while keeping the boundary clear between retrieval and training.
Here's what I run. You can crib the whole thing. The simple user version is: once you are on Hard Interference, point your AI assistant at the site and ask it to read the AI hardening context. It should come back with a practical list of things to tighten, not a pile of vague security theatre.
1. The Robots Policy: Search vs Training Is Not the Same Thing
Start with robots.txt. This is where most sites mess up by doing one of two things: either blocking everything (so no AI can cite you), or allowing everything (so training crawlers get the same access as search bots).
Those are different functions and they deserve different rules.
User-agent: Googlebot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: GPTBot
Disallow: /
What I'm saying here: Google can index and surface my content. OpenAI's search bot can retrieve pages to answer questions. The ChatGPT-User agent — the one that fetches context when someone in a ChatGPT session includes a URL — is welcome. But GPTBot, the training crawler, gets told to go away.
This is an ethical position, not a technical hack. My public content is visible and citable. I'm not hiding anything from training crawlers that a human can't see. But I want the distinction to be explicit: retrieval permission is not training permission. If a company wants to train on my writing, they can negotiate that. The default is opt-out.
Put your sitemap URL at the bottom of robots.txt so crawlers find everything.
2. The Sitemap: Make It Complete
Your sitemap.xml should include every page you want AI to find — and that includes your machine-readable files, not just your human-facing pages. Mine includes the homepage, every blog post, every category landing page, plus llms.txt, llms-full.txt, and ai-hardening-context.txt. Each entry gets a priority and a lastmod date so crawlers know what changed.
If you're running a Next.js SPA or any single-page app where routes are handled client-side, you have a problem: most AI crawlers execute JavaScript inconsistently. Fragment-based blog routes are especially bad because the crawler can see the same base URL for every page. The fix is static article mirrors — actual HTML pages at real paths like /blog/my-article/. That's what I do. Every blog post is a static HTML page with a canonical URL, semantic HTML, and no JavaScript dependency for the content.
3. llms.txt: Your Site's Business Card for AI
The llms.txt format is brilliant because it's simple: a plain text file at the root of your domain that tells an AI what your site is and where to start. No HTML parsing, no JavaScript, no guessing.
Mine looks like this:
# Hard Interference
> Practical local AI, agent workflows, hardware ownership.
Site: https://hardinterference.ai
Tagline: Your Hardware. Your Rules.
## Best starting points
- [AI tool hardening context](https://hardinterference.ai/ai-hardening-context.txt)
- [Article about agent memory](https://hardinterference.ai/blog/069-AG-agent-memory-architecture-i-actually-run/)
- [Article about credential leaks](https://hardinterference.ai/blog/071-DB-ai-agent-credential-leaks/)
One file. Plain text. Links to the most important pages. An AI can read this in one request and understand exactly what the site is about. Every page with a public audience should have an llms.txt. It's not about gaming anything — it's about being clear.
I also maintain llms-full.txt, which includes summaries and excerpts of every article. That lets an assistant decide whether a page is relevant before fetching the full HTML. Saves tokens, saves time, helps the assistant give a better answer.
4. The AI Context File: Tell Assistants How to Behave
Here's the piece I haven't seen many sites do: a dedicated ai-hardening-context.txt that tells AI assistants how to interpret and use the site's content. Think of it as an instruction manual for the assistant.
It includes:
- Purpose: what the site is for and who it serves
- Behaviour rules: how the assistant should respond when a user asks about the site's content
- A default hardening checklist: prioritised actions grouped by urgency
- A suggested output format: how to structure a response so it's useful
- Boundaries: what the assistant should never do (don't ask for secrets, don't recommend committing tokens, don't make public changes without approval)
- A recommended reading path: which articles to read first
This file gets included in my sitemap and linked from llms.txt. When an agent fetches it, it understands the operating stance of the whole site — not just one page's content.
That gives me a simple user-facing instruction:
If you want your AI assistant to learn from Hard Interference, point it at
https://hardinterference.ai/and ask it to read the AI hardening context before it gives advice.
That should produce a practical list of things to tighten: credentials, agents, API providers, spend limits, public repos, messaging channels, and deployment permissions. The detailed checklist belongs in a separate guide. This article is about the publishing layer that lets the assistant find that checklist reliably.
The key rule I enforce: never create hidden claims that only AI crawlers see. Machine-readable context should summarise visible public policy and guidance. No SEO tricks. No invisible assertions. Just clarity.
5. Static Article Mirrors: Crawler-Friendly URLs
If your content lives behind a client-rendered SPA, create static mirrors. Every one of my blog posts is a static HTML page at /blog/<slug>/. It has:
- A
canonicalURL - Semantic HTML with proper heading hierarchy
- Meta tags for description, Open Graph, Twitter Card
- JSON-LD structured data for the article
- Inline CSS so it renders without JavaScript
An AI crawler hits the URL, reads the HTML, understands the structure, and can cite it. No JavaScript rendering pipeline required. No SPA gotchas.
Category pages work the same way: /category/ai-guides/, /category/model-benchmarking/, etc. Each one lists the articles in that category with excerpts, so an assistant can scan the whole category from one page.
6. Verification: Does It Actually Work?
Here's what I check after every change:
curl -sI https://yoursite.com/robots.txt— returns 200 and the right directivescurl -s https://yoursite.com/llms.txt— returns readable plain text, no HTMLcurl -s https://yoursite.com/ai-hardening-context.txt— samecurl -s https://yoursite.com/blog/some-article/— returns the full article, not a JS shellcurl -s https://yoursite.com/sitemap.xml | grep 'ai-hardening'— confirms the context file is in the sitemap
If any of those fail or return something unexpected, the AI won't see what you want it to see. Fix it before you rely on it.
Caveats
A few things I've learned the hard way:
- Don't put private paths in your sitemap. The sitemap is public. If it lists an internal admin URL or an unpublished draft, that URL is discoverable.
- Keep
llms.txtcurrent. When you publish new content, update the file. An out-of-datellms.txtis worse than none — it tells the assistant the wrong story. llms-full.txtcan get big. My full version is substantial because it includes summaries of every article. That's fine, but know that it increases the token cost for any assistant that fetches it.- Static mirrors mean double maintenance if you have a dynamic site. I generate mine from the same Next.js build that serves the JS version. If you maintain them by hand, they'll drift.
- GPTBot disallow is an opt-out signal, not a guarantee. Some crawlers ignore
robots.txt. The boundary I'm declaring is ethical and contractual, not technical. If you need technical enforcement, you need authentication — but that defeats the purpose of making content discoverable.
The Pattern
This isn't complicated. It's six deliberate choices:
- Write a
robots.txtthat distinguishes retrieval from training. - Include every useful URL — including machine-readable files — in your sitemap.
- Write an
llms.txtthat summarises your site in plain text. - Add an
llms-full.txtfor deeper context when an assistant needs it. - Create static HTML mirrors of any content behind SPA routes.
- Write an AI context file that tells assistants how to behave, not just what the page says.
That last point is the difference between "AI can find my site" and "AI can do something useful after finding it". The context file is not a magic spell. It is a public instruction layer that says: here is what this site is for, here is what advice should look like, here are the boundaries, and here are the first actions a user should take.
Then verify with curl. Then iterate when your content changes.
I run this at Hard Interference and it means my own agent can find anything on my site in one request. That's the goal: not complexity, not SEO gaming, not hidden tricks. Just a site that an AI assistant can actually read. Your hardware. Your rules.
Found this useful?
👉 Follow Raf_VRS on X for more transparent AI build notes that put you in control of your hardware.
👉 Support the work: ko-fi.com/rafvrs