Advanced Guide

Updated June 21, 2026 | 13 min read

Multimodal Retrieval: How AI Search Reads and Cites Your Images, Tables, and Charts

By Digital Strategy Force

AI search engines do not see your images, tables, or charts. They retrieve the text around them: the alt attribute, the caption, the semantic table, the prose beside a figure. A chart whose numbers live only in pixels is, to the retriever, a blank rectangle it cannot cite.

Airport runway and taxiway markings on dark tarmac at night, a coded visual language for AI multimodal retrieval

MODERNIZE YOUR BUSINESS WITH DIGITAL STRATEGY FORCE • ADAPT & GROW YOUR BUSINESS IN A NEW DIGITAL WORLD • TRANSFORM OPERATIONS THROUGH SMART DIGITAL SYSTEMS • SCALE FASTER WITH DATA-DRIVEN STRATEGY • FUTURE-PROOF YOUR BUSINESS WITH DISRUPTIVE INNOVATION • MODERNIZE YOUR BUSINESS WITH DIGITAL STRATEGY FORCE • ADAPT & GROW YOUR BUSINESS IN THE NEW DIGITAL WORLD • TRANSFORM OPERATIONS THROUGH SMART DIGITAL SYSTEMS • SCALE FASTER WITH DATA-DRIVEN STRATEGY • FUTURE-PROOF YOUR BUSINESS WITH INNOVATION •

Table of Contents

The Modality Gap: Why AI Search Reads Around Your Visuals, Not Through Them

Multimodal retrieval is the process by which AI search converts a page's images, tables, and charts into text it can embed, rank, then cite. No major engine retrieves a raw picture; it retrieves a text surrogate, the alt attribute, the figcaption, the semantic table, the prose beside a figure, or a multimodal embedding that has already translated the pixels into a vector. When that surrogate is missing, the visual carries zero retrievable signal, so the data inside it cannot be cited regardless of how authoritative it looks.

The confusion starts with a reasonable assumption. Models like ChatGPT and Gemini can clearly describe an image when you hand them one inside a prompt, so it feels as though AI search must see your visuals the same way. It does not, at least not where it counts. Google's own image guidance states that it uses the alt text along with computer vision algorithms plus the contents of the page to understand the subject of an image, which means the words around the picture are doing the work the picture cannot. The retriever that decides which pages enter an answer runs on text and vectors, not on pixels.

That gap is where a great deal of expensive content quietly disappears. A pricing table rendered as a screenshot, an infographic whose numbers live only in the JPEG, a chart exported straight from a spreadsheet as a flat image: each one looks authoritative to a human reader, yet each arrives at the engine as a blank rectangle with nothing to lift. Digital Strategy Force calls the rule beneath this the Surrogate Principle, that AI search cites the text it can read about a visual, never the visual itself, then names the path a visual must travel to satisfy it the Multimodal Extraction Stack.

Essential context: how AI models score passages before citation · why a cited page is not always a quoted one

Three Modalities, Three Failure Modes

Modality	What AI Actually Reads	Lost When Pixels Only	The Fix
Image	Alt text, caption, surrounding prose	The subject, the data, any text baked in	Entity-rich alt plus a descriptive caption
Table	Row and column structure, header cells	Every relationship between the numbers	A semantic table with header scope
Chart	The data table or numbers in the prose	The values, trapped inside the rendered image	Publish the underlying data as text too

Source: Google Search Central, Image SEO (2026). Framework: Digital Strategy Force.

The DSF Multimodal Extraction Stack: The Five Layers Every Visual Must Clear

The DSF Multimodal Extraction Stack is the five layers a visual asset must clear, in order, before any AI engine can cite the fact inside it. The layers compose as a chain, not a sum: an asset that clears the first four but fails the fifth is still uncitable, because the weakest layer caps the whole. Read top to bottom, the Stack turns a vague worry, will AI see my visuals, into five concrete pass-or-fail gates you can audit one at a time.

Layer 1, Access. The asset and the text around it must be fetchable from the raw HTML, without a browser running JavaScript. A chart drawn into a canvas element, an image injected by a script after load, a figure hidden behind a lazy-load gate that never fires for a bot: each is invisible before the question of meaning is ever reached. Access is the floor, and a surprising share of visual content fails here for the same reason that a page can be read only partway through.

Layer 2, Surrogate. This is the gate most pages fail. A machine-readable text stand-in must exist for the visual: an alt attribute, a figcaption, selectable rendered text, a real semantic table rather than a screenshot of one, or a data table sitting beside a chart. Without a surrogate, the asset reaches the engine as undifferentiated pixels. Everything above this layer assumes a surrogate is already present, which is why giving an asset words is almost always the highest-leverage fix on the entire Stack.

Layer 3, Binding. A surrogate only helps when it is bound to meaning. The caption has to name the entities the visual is about, the figure has to be referenced in the prose that surrounds it, then the table has to carry header cells with scope so the engine knows which number belongs to which row. Binding is what lets a retriever connect the visual to the query, the same discipline that makes well-structured content legible to a model rather than merely present.

Layer 4, Embedding. The bound surrogate then has to reach the retrieval vector space, either through a text embedding of its words or a native multimodal embedding that turns the picture itself into a vector. This is the layer where the visual finally competes for retrieval alongside ordinary prose, scored by the same passage-level similarity that ranks every other candidate. A clean surrogate lands close to the query; a vague one lands nowhere useful.

Layer 5, Attribution. The final gate asks whether the page is the canonical home for the visual fact. A chart of your own proprietary data, on your own domain, with a caption that claims it, is attributable; a stock chart re-hosted from a vendor, or a screenshot of someone else's table, is not, and the engine will name the original source instead of you. Attribution is what converts a readable visual into a cited one.

Because the gates are sequential, the Stack also tells you where to spend. Most visual assets on the web fail at Surrogate, the second layer, so the highest-leverage move is rarely a better chart or a fancier embedding; it is giving the asset words. The sophistication above Surrogate only earns anything once a surrogate exists, which is the order most teams reverse.

The DSF Multimodal Extraction Stack

LAYER 1

Access

The asset and its surrounding text are fetchable from raw HTML, with no JavaScript required to render them.

LAYER 2

Surrogate

A text stand-in exists: alt, figcaption, a semantic table, or a data table beside a chart. The gate most pages fail.

LAYER 3

Binding

The surrogate names the entities, the figure is referenced in prose, and the table carries header scope.

LAYER 4

Embedding

The surrogate reaches the vector space, via a text embedding or a native multimodal embedding of the image.

LAYER 5

Attribution

The page is the canonical home for the visual fact, so the engine can name you rather than the original source.

Framework: Digital Strategy Force Multimodal Extraction Stack.

That ordering, a surrogate before any sophistication, is the whole argument compressed into a sequence. It is worth stating in a form a content team can repeat back before every publish, because it overturns the instinct to reach first for a prettier visual when the engine has not yet been given anything to read.

"AI search never sees your chart. It reads the caption beneath it, the sentence beside it, then the table behind it. A visual whose data lives only in pixels is, to the retriever, a blank rectangle it cannot quote."
— Digital Strategy Force, Search Intelligence Division

How AI Reads an Image: Alt Text, Captions, and the Pixels It Cannot Parse

An AI engine reads an image through its surrogate first, and the most important surrogate is still the humble alt attribute. Google states that the alt text is the single most important attribute for providing metadata about an image, used together with computer vision plus the surrounding page content. The catch is that most pages still neglect it. The 2025 Web Almanac measured a median of only 60 percent of images carrying an alt attribute at all, with a further 15 percent of mobile images carrying an alt value that is blank. The median informative image is, in other words, under-described or undescribed.

The sharpest failure is text baked into the graphic. An infographic where the only copy of a statistic lives inside the JPEG, a quote rendered as a styled image, a diagram whose labels are pixels rather than characters: a human reads all of these instantly, while the retriever reads none of them. The accessibility standard codifies the remedy. WCAG Success Criterion 1.1.1 requires that all non-text content carry a text alternative serving the equivalent purpose, which is the same requirement AI extraction imposes for a different reason.

The fix for informative images is concrete. Write entity-rich alt text that names the subject in plain language rather than stuffing keywords, add a visible caption that states what the image shows, then declare the image with ImageObject structured data carrying a caption property. Decorative images should keep an empty alt so they are correctly ignored. The principle is the same one that governs structured data more broadly: if a fact matters, it has to exist as text the machine can read, not only as a picture a person can see, which is exactly what schema markup makes explicit to AI.

What an Engine Reads From an Image

Alt text

The primary surrogate. A median 60 percent of images carry one, and many of those are blank.

Caption

A visible figcaption that names the subject, reinforced by an ImageObject caption in schema.

Surrounding prose

The sentences before and after the image, which bind it to the query plus the page entities.

Sources: HTTP Archive Web Almanac (2025), Google Search Central (2026).

How AI Reads a Table: Semantic HTML vs a Picture of a Table

Tables are where the modality gap is widest, because the two ways of publishing them sit at opposite ends of extractability. A real HTML table, with header cells marked by scope, is among the most machine-readable structures on the web: the engine knows which value belongs to which row and column, so it can lift a single cell into an answer with its full meaning intact. A screenshot of a table is among the least readable structures on the web, because every one of those relationships is now locked inside an image.

Frontier vision models have narrowed that gap, but not closed it. Google reports that Gemini 3 Pro can accurately transcribe tables and reason across them inside long documents, which is real progress. Transcription is still lossy, though, and it is engine-dependent: a model that handles a clean two-column table may misread a merged-cell layout, while a 2025 survey of document retrieval notes that OCR-based pipelines routinely lose the structural detail that gives a table its meaning, per a multimodal RAG review. Relying on the engine to re-derive structure you could have published natively is a gamble with no upside.

The fix costs almost nothing and removes the gamble entirely. Publish tabular data as a semantic table, mark the header cells with scope, then reference the table in the prose so it is bound to the query it answers. The structure that makes a table accessible to a screen reader is the same structure that makes it extractable by an engine, which is why the cheapest accessibility win and the cheapest AI-visibility win are frequently the identical edit. A picture of a table should be reserved for when the table itself is the subject, never for delivering data you want cited.

The Same Data, Two Levels of Extractability

Strong

A semantic <table>

Header cells with scope, real rows and columns. The engine lifts any single value with its full meaning, and a screen reader navigates it cleanly.

Invisible

A picture of a table

Every relationship between the numbers is trapped in pixels. The engine must OCR it, structure is lost, and the values cannot be cited reliably.

Sources: Google (2025), Multimodal RAG survey, arXiv (2025).

How AI Reads a Chart: The Data Trapped in Your PNG

Charts carry the sharpest version of the problem, because reading numbers out of pixels is genuinely hard even for the best models. On a clean benchmark, a frontier model looks fluent: Claude Sonnet 3.5 scores 90.5 percent on the original ChartQA test. On ChartQAPro, a 2025 benchmark built from more diverse and realistic charts, the same model scores 55.81 percent. The thirty-five point fall is the measure of how unreliable chart reading becomes the moment a chart looks like the ones brands actually publish.

Even the strongest models have a ceiling rather than a guarantee. Google reports that Gemini 3 Pro reaches 80.5 percent on the CharXiv reasoning benchmark, above the human baseline, which is a remarkable result on demanding scientific charts. It is still a reasoning score on a test set, not an assurance that your specific exported PNG, with its custom legend and its small annotations, is parsed correctly on the one query that matters. When a chart's numbers exist only inside the image, every engine is making an educated guess, and the guesses get worse as the chart gets more bespoke.

The remedy is the accessibility pattern again, applied for citation. WCAG guidance for a complex image such as a chart is to provide the actual data in a table and a short summary of the trend in text. Do exactly that: keep the chart for the human eye, then publish the underlying numbers as a semantic table or state the headline figure in the sentence beside it. The chart becomes the illustration; the data becomes the citation. A page that does this stops gambling on whether the engine can read its graphics, because the answer no longer depends on the pixels at all.

The Chart-Reading Cliff

ChartQA, clean academic charts90.5%

ChartQAPro, diverse realistic charts55.81%

Claude Sonnet 3.5 on the same task, two chart sets. The thirty-five point drop on realistic charts is the gap between a benchmark and the graphic you actually shipped.

Benchmark	Accuracy
ChartQA	90.5%
ChartQAPro	55.81%

Source: ChartQAPro, arXiv (2025).

Step back from any single modality, then the measured picture is consistent across all three. The visual web is mostly unreadable to the systems now answering the questions, even as those systems keep improving and the retrieval gains for going visual keep climbing. The four figures below size that landscape in one view.

The State of the Visual Web

Median share of images carrying an alt attribute, with many of those left blank

A frontier model reading realistic charts, down from 90.5 percent on clean ones

Gemini 3 Pro on the scientific-chart reasoning benchmark, above the human baseline

Higher precision (mAP@5) for direct multimodal embeddings over text-summary retrieval

Sources: HTTP Archive Web Almanac (2025), ChartQAPro, arXiv (2025), Google (2025), arXiv (2025).

Native Multimodal Embeddings: When the Engine Skips the Text Entirely

There is a frontier path where the engine bypasses the text surrogate and embeds the picture itself. A multimodal embedding turns an image plus its text into a single vector, so a query can match a visual directly, and a visual document retriever embeds a whole page image rather than its extracted words. This is production technology, not a lab demo. Cohere's Embed v4 produces a unified embedding from a mixed payload of text, images, then graphs in one document, while Jina's embeddings v4 handles visually rich documents without any OCR preprocessing at all.

This does not rescue a pixel-trapped page from the Stack; it raises the floor without removing it. Native multimodal retrieval still rewards clean structure plus legible visuals, and the measured gains are real: one 2025 study found that direct multimodal-embedding retrieval beats summarizing an image into text first by 13 percent on mAP@5, then 11 percent on nDCG, because summarization loses information the picture carried. A separate benchmark across thousands of real PDF pages, UniDoc-Bench, found that multimodal text-image fusion consistently outperforms either modality alone. The engines are getting better at reading visuals, which raises the stakes for having something worth reading.

The practical takeaway is liberating rather than daunting. You do not build a multimodal embedding; the engines and their retrievers do it for you, with research like VLM2Vec-V2 is pushing visual document retrieval forward every quarter. Your job is to hand whichever path the engine takes a clean asset: a legible chart with a real axis and a readable legend, a table with genuine structure, an image with an honest caption. Optimize the visual for a human, then the surrogate for a machine, so both the text route and the native vision route land on your data instead of around it.

Embedding the Picture Beats Summarizing It First

Retrieval precision, mAP@5+13%

Ranking quality, nDCG+11%

Absolute improvement when an engine embeds the visual directly rather than converting it to a text summary first. Summarization drops detail the picture carried.

Metric	Absolute gain
mAP@5	+13%
nDCG	+11%

Source: Text-Based and Image-Based Retrieval in Multimodal RAG, arXiv (2025).

Whether the engine reads your text surrogate or embeds the picture itself, the asset travels the same route from a visual to a citable passage. Tracing that route makes the failure points obvious, because a break at any stage ends the journey before a citation is possible.

From Pixel to Passage

STEP 1

A visual asset: an image, a table, or a chart on the page

↓

STEP 2

A surrogate is derived: alt, caption, OCR, a semantic table, or a native multimodal embedding

↓

STEP 3

The surrogate lands in the retrieval space as text or a vector, scored against the query

↓

STEP 4

The passage is lifted into the answer, with your page named as the source

DEAD END

Pixels only, no surrogate at Step 2: the asset never enters retrieval and cannot be cited

Framework: Digital Strategy Force Multimodal Extraction Stack.

The DSF Visual Extractability Scorecard: Rating Every Asset Strong, Partial, or Invisible

The DSF Visual Extractability Scorecard turns the Stack into a fast, repeatable audit. It rates every image, table, then chart on five checks, the same five layers expressed as questions: is there a text surrogate, is the structure machine-parseable, is the asset entity-bound, is the data free of the pixels, then is the page the attributable source. An asset that passes all five is Strong, an asset that passes some is Partial, then an asset that fails the surrogate check is Invisible no matter how good it looks.

A worked example shows how fast the diagnosis runs. A mid-market B2B software company shipped its pricing comparison as a single exported PNG, proud of the design. On the Scorecard it rated Invisible on three of five checks: no text surrogate, data trapped in pixels, no parseable structure. The team rebuilt it as a semantic table with a captioned figure and the headline numbers restated in the prose, changing nothing about the underlying data. Within weeks the pricing page began surfacing inside Gemini comparison answers, because its readability, not its content, had changed.

Using the Scorecard is a triage exercise, not a grade. Run it across your highest-value pages, then read the lowest check as the bottleneck, because influence is gated by the weakest layer rather than the average. In practice the lowest check is almost always Surrogate, which is good news: the fix is cheap, it doubles as an accessibility win, and it sits at the bottom of the Stack where one edit unlocks every layer above it. The scorecard below is the version a team can run before publishing any visual-heavy page.

The DSF Visual Extractability Scorecard

Text surrogate

Strong: alt, caption, or a data table is present.

Invisible: pixels only, nothing to read.

Parseable structure

Strong: semantic table, header scope, real text.

Invisible: a flat screenshot of structure.

Entity-bound

Strong: caption names the subject, figure cited in prose.

At risk: an orphan image with no context.

Data not pixel-trapped

Strong: numbers exist as text beside the chart.

Invisible: values live only in the image.

Source-attributable

Strong: original data on your own canonical page.

At risk: a re-hosted stock visual.

The verdict

Five passes is Strong, some is Partial, a failed surrogate is Invisible.

Framework: Digital Strategy Force Visual Extractability Scorecard.

The Modality You Forgot to Optimize

Most AEO work has gone into the text, because the text is where the discipline began. The visual layer was left to look good for humans, which is exactly why it is now the cheapest ground to win. Competitors have poured years into their prose while leaving the same blind spot, so the page that fixes its visual layer first does not merely catch up, it passes them on the exact assets buyers find most persuasive.

The numbers say so plainly: a median of only 60 percent of images carry alt text, a frontier model reads realistic charts correctly barely more than half the time, then the brands that publish their data as readable text rather than trapped pixels inherit the citations the rest forfeit by default.

The Surrogate Principle is the whole guide in one line: AI search cites the text it can read about a visual, never the visual itself. Engineering for it is not a redesign, it is a habit, giving every informative image a real caption, every table real structure, then every chart its underlying numbers in the prose. Do that, and the most impressive assets on your page stop being invisible to the systems that now decide what your buyers see. The chart you spent a day perfecting should be the one the answer is built from, not the blank rectangle the retriever skips.

FAQ — Multimodal Retrieval

How does AI search read an image it cannot see?

It reads a text surrogate, not the pixels. Google states it uses the alt text together with computer vision plus the surrounding page content to understand an image, then the retriever ranks that text. An image with no alt, no caption, then no descriptive prose nearby gives the engine nothing to retrieve, so its subject and any data it holds cannot be cited.

Does alt text still matter for AI search, or only for accessibility?

It matters more for AI search than accessibility alone ever made it. Alt text is the primary surrogate an engine embeds for an image, yet the 2025 Web Almanac shows a median of only 60 percent of images carry one and 15 percent on mobile are blank. Digital Strategy Force treats informative alt text as a retrieval asset rather than a compliance checkbox, because the same edit earns both a citation and an accessible page.

Should I publish data as an HTML table or an image of a table?

Always a semantic table with header cells marked by scope. A real table is one of the most extractable structures on the web, while a screenshot of a table is one of the least, because the engine has to OCR it and OCR loses the row-column structure that makes the data meaningful. Reserve a picture of a table for when the table itself is the subject, never for delivering numbers you want cited.

Why do AI models get the numbers in my chart wrong?

Because reading data out of pixels is genuinely hard. The same frontier model that scores 90.5 percent on a clean chart benchmark scores 55.81 percent on a realistic one, so a bespoke exported chart is well within the range where the engine guesses. If your numbers exist only inside the image, every engine is guessing. Publish the underlying data as a table or state the headline figure in the sentence beside the chart.

What is a multimodal embedding, and do I need one on my site?

A multimodal embedding is a single vector that represents an image and text together, so a query can match a picture directly. You do not build one; the engines and their retrievers do, using models such as Cohere Embed v4 plus Jina v4 that embed visually rich documents natively. Your job is to give those embeddings clean structure plus legible visuals to lock onto, because direct multimodal retrieval still rewards readable assets.

How do I make one chart citable across ChatGPT, Gemini, and Perplexity at once?

Pair the chart with three surrogates every engine reads: a descriptive caption that names the entities, the underlying numbers in a semantic table, then one sentence of prose stating the headline figure. That covers the text route and the native vision route at the same time. Digital Strategy Force scores each asset on the Visual Extractability Scorecard before publication so a single figure clears every engine's retrieval rather than just one.

Next Steps — Multimodal Retrieval

The Stack is a diagnostic, so start by scoring. Run your visual-heavy pages through the five checks before deciding where to invest the effort.

▶Score your top 25 pages on the Visual Extractability Scorecard, tag every image, table, and chart Strong, Partial, or Invisible, then fix the failed-surrogate assets first.
▶Convert every image-of-a-table into a semantic table with header cells marked by scope, since structure is what makes a table both extractable and accessible.
▶Add the underlying data, as a real table or a one-line figure, for each chart whose numbers currently live only inside a rendered image.
▶Audit alt text on informative images first, writing entity-rich descriptions rather than keyword strings, then leave decorative images with an empty alt.
▶Re-reference every figure in the prose so each visual is bound to the passage it supports, the way well-structured pages earn the highest influence in AI answers.

Digital Strategy Force Answer Engine Optimization runs the Visual Extractability Scorecard across every image, table, and chart on a site, names the assets the retriever cannot read, then rebuilds them into citable surrogates. To make your visual layer machine-readable before your competitors' charts become the ones AI quotes, explore Answer Engine Optimization (AEO) with Digital Strategy Force.

// DISCUSS WITH AI

Open this article inside an AI assistant — pre-loaded with DSF's framework as the lens.

▸ Perplexity ▸ ChatGPT ▸ Gemini ▸ Claude