How are some LLMs so good at OCR while on the other hand dedicated OCR solutions can't outperform them

Obsidian Metadata

source	https://www.reddit.com/r/LocalLLaMA/comments/1hjfirl/comment/m36dovv/?context=3
author	ThiccStorms
published	2024-12-21
description

Total newbie to the LLM/ AI field. AFAIK they are meant for text input and output, and if the training data had images and annotations, then well how did they get better than dedicated tech? And can we really decode how OCR works internally in LLMs? or its another black box problem?

Comments

Anaeijon • 284 points •

I actually evaluated the feasibility of abusing multimodal models for OCR tasks.

Classic OCR is good and especially reliable. It attempts to get the best possible fit per character, not to transcribe what’s likely written on the page.

LLM ‘OCR’ on the other hand works by using image embedding. LLMs don’t transcribe what’s on the page. They transform every input (not just images, but in this case images and text) into a high dimension representation of it’s content.

So, basically LLMs create a high dimensionality graph and every information, so originally every sentence, word, paragraph, chapter, textbook, chat message or novel can be represented as a dot in that space. On large media this process looses a lot of information, because it really just encodes a rough position in relation to other text. For example, a Book and it’s perfect summary would have nearly exactly the same representation in this hyperdimensional space. That’s what LLMs use for everything they are good at.

Want to translate a sentence from one language to another? Just embed the information into that high dimensional space, drop the language information and use statistical methods to find a text with exactly the same position (=same meaning) in another language. Want to summarize a complicated paper in simple words? Embed the paper and use statistical methods to find a paragraph with roughly the same position, just maybe some tonality information dropped. Want to answer a question? Embed the question and then find the statistically most likely sentence that sounds like an convincing answer with a minimal distance to the embedded question.

That’s how LLMs work.

The important thing here is, that this method is never perfectly reversible. So, embedding information always looses some information and you can never (programmatically, mathematically) know if this information was relevant. So, if you feed a novel into a really large embedding and then retrieve a novel with a perfect match again, it might be quite similar, but it will still be written very differently. Especially because the method of generating text is purely stochastic and will always favour more common phrases, words, sentences and tokens over less common ones a good author would use.

So… That’s for purely text based LLM.

Now how does multimodality work? Simply said: a multimodal LLM/transformer model is not just capable of embedding text of arbitrary length but can also embed images into the same space. Now, an image that’s basically just a photo of a text, will likely end up at nearly exactly the same embedded position as the text in itself would have ended up in, plus maybe some meta information embedded, that might represent the fact this looks like a scan, a photograph, an article printed in a paper… If you embed a photo of a kangaroo wearing a Christmas sweater and then embed the text ‘a photo of a kangaroo wearing a Christmas sweather’ both will likely end up with nearly the same embedding.

Now, again, this method can’t be reversed perfectly. You might be able to generate a picture that has the same embedding (e.g. a picture with roughly the same text on roughly the same background at roughly the same font …) but it will never be even just 99% correctly reversed unless it’s an extremely simple input image or has been AI generated in the first place.

So, when you use a multimodal ‘LLM’ for an OCR task, it basically goes like this: the multimodal model embeds that input image into a hyperdimensional space. This process drops a lot of information. For example, it will likely drop the information, where exactly on the page some information came from. After that, the LLM attempt to find a statistically likely text with roughly the same embedding, again dropping meta information like font, paper colour, shape, uncommon words or weird ways of formulating something. It will generate a text that is not 100% accurate, but a statistically likely representation of the input.

So, if you have some easy, already stochastically flat generated text (=AI generated/improved text), print that on paper, scan that paper and let the LLM retrieve that text again, it will likely generate a text, that is a near perfect match. When working with real, human inputs though, this might introduce other problems. For example, the initial embedding of the image probably drops errors, mistakes and complex information, as they either don’t fit into the embedding dimension or are stochastically irrelevant. That process might therefore automatically correct typos, exchange words for easier readable ones or even auto-translate some text segments. It will also forget, where exactly on the initial input some information came from.

Over all, it basically behaves like a strong lossy compression algorithm, that would lead to issues similar to the big Xerox scandal 10 years ago. We surely want to avoid that.

The worst thing: just using LLM methods, it’s impossible to detect these potential problems, because the output will allways look more correct and statistically desirable than an actually correct and perfect OCR would.

Also, for real OCR some information that would get ignored by LLMs is really relevant. For example, where exactly each letter was in the original document. Because, imagine you OCR some scanned document and then the OCR result is just a very text from that document sitting invisibly in the top left corner, completely misaligned with the actual document. So when you go in to the document to copy some number or a name out, you just copy some other random text segment that happened to end up in that position.

Lastly, some tools that integrate LLMs with PDF readers add another trick on top to mimic the capability of determining where exactly some information came from: They simply do classic OCR on top of the image embedding. They basically (classically) OCR the document, often using really good proprietary methods, to get a list of all text blocks with their position from the document. Then they embed this text information alongside the actual images, using the multimodal capabilities. So, when the model then responds to some question and actually quotes text from the picture, it likely cobbles that answer together from the positionally more accurate classic OCR and the extracted information.

So, in the end: Don’t use LLMs for OCR. They will look convincing in your tests, but on real tasks they will hallucinate a lot. And because hallucinations are always convincing, it will be hard or even impossible to detect the problems introduced by your method.

Edit: I oversimplified the embedding process a bit. Actually it happens in steps where segments of an input also get embeddings. In theory this could be used, to get region-specific information. Please see this comment by Infonite-Cat007: https://www.reddit.com/r/LocalLLaMA/s/iV6DVLIorN

track me

Infinite-Cat007 • 67 points •

While your overall message about the risks of using LLMs for OCR is sound, I think there’s an important technical mischaracterization in how you describe the process. You seem to describe LLMs as creating a single high-dimensional embedding of the entire input, but this isn’t accurate for modern autoregressive LLMs, which typically use decoder-only architectures.

To my understanding, natively multimodal GPT models work something like this:

Images are first tokenized into smaller patches

These patches are individually embedded along with positional information

During processing, attention mechanisms allow the model to selectively attend to relevant portions of the input based on the current context

The model generates outputs token by token, using attention to reference back to relevant parts of the input as needed

This is fundamentally different from encoder or encoder-decoder architectures that might create a single comprehensive embedding. The limitations in OCR tasks come not from embedding the entire document at once, but rather from the quality and granularity of the image patch representations and the model’s ability to work with them effectively through its attention mechanisms.

Anaeijon • 15 points •

Yes, you are correct here. That’s a simplification I made and your points add valuable information.

Current LLMs don’t embed things into only one Vector. They basically handle this recursively and generate embeddings for the whole thing, as well as each of its parts and each of the parts of each part, down to an atomic level. So for example for text, you get one embedding for the whole document, one for each chapter, one for each paragraph, one for each sentence and one for each word (yes, again, this is oversimplified). Later on, either the vector database or the model itself will dynamically decide, which of these chunks are actually useful for generating an answer.

Now, applying this to images (and I’m not sure if it’s actually done this way, I personally haven’t seen it) you can obviously deconstruct the image into segments, similar to the behaviour a convolutional model would apply. Then every segment, down to some small pixel region, could get it’s own embedding. That way, the model could actually predict, which segment of the image one specific information comes from or where exactly some letter is.

As you wrote, it all comes down to the limit, of how many patches the LLM could handle, which currently probably isn’t much.

I think, even with more granularity, this wouldn’t work well for OCR on the current, token based generation of transformer models. But, considering upcoming Byte Latent Transformer technology, an approach like that could actually match pixel regions to specific text bytes and handle OCR tasks accurately. The question is, if that still makes sense at all. All it would do in the end, is classic OCR, just in a really complicated way with a lot of extra calculations required.

dstrenz • 32 points •

Or, to have an llm summarize that:

Using multimodal language models (LLMs) for Optical Character Recognition (OCR) is not recommended. Traditional OCR methods are reliable because they accurately fit each character without altering the original text. In contrast, LLMs use high-dimensional embeddings to represent both text and images, which leads to significant information loss and inaccuracies. This embedding process cannot be perfectly reversed, resulting in text that is statistically likely but not an exact transcription of the original. While LLM-based OCR might perform well with simple, AI-generated text, it struggles with real human inputs by introducing errors, correcting typos, and losing the original layout. Additionally, these models cannot detect the inaccuracies they introduce, making it difficult to identify and rectify issues. Some tools mitigate this by combining classic OCR with LLM embeddings, but overall, relying solely on LLMs for OCR can lead to unreliable and misleading results.

Anaeijon • 42 points •

And that’s the perfect example:

After embedding and retrieving my long text, the LLM you used, dropped a lot of intricate details, which I put in there to help a human reader with little previous knowledge understand underlying principles and not come to wrong conclusions by filling gaps with assumptions.

It also introduced hallucinations, for example:

While LLM-based OCR might perform well with simple, AI-generated text, it struggles with real human inputs *by introducing errors*, correcting typos, and losing the original layout.

I never wrote, LLMs would introduce errors. It’s actually more likely they remove errors, which still is undesirable for an OCR system.

An over all better summary would be this comment by GHOST—1, lol.

TLDR of this comment using claude:

LLMs may oversimplify and alter text when summarizing.

Small-Fall-6500 • 8 points •

I never wrote, LLMs would introduce errors. It’s actually more likely they remove errors, which still is undesirable for an OCR system.

I just thought it ironic that by “removing errors” it would technically be adding errors.

Weary_Long3409 • 3 points •

Recently I ran Qwen2-VL-72B-Instruct to extract hundreds of PDF docs scanned with CamScanner and removing here and there part, including page numbers, signature boxes, “Scanned with CamScanner”, etc. I found OCR with MLMM useful and wickedly fast and really acceptable output.

It might be different case if I need an exact pure text OCR, I will use traditional OCR. Is there any OAI endpoint API of traditional OCR framework? Already try stepfun-ai/GOT-OCR2_0 which is good enough, but I would love to use it if there’s support for current backend.

blackkettle • 5 points •

What a great summary. Thanks for sharing that!

Affectionate-Cap-600 • 4 points •

that’s a really good explanation!! thank you for writing this!

just want to add a thing: given the cross attention architecture that many of those LLMs use to ‘mix’ the image encoders data with the textual component, the prompt paired with the image is really relevant.

as you said, an example: … usually positional elements will be lost, but if you mention them into the prompt many models will focus on those elements, getting them right but missing other information that they wouldn have missed if the weren’t prompted to focus on positional elements.

drdailey • 3 points •

I have done text extraction using OpenAI. GPT-4o and compared to tesseract, textract and it wins. Presumably because my promp tells to it pull the information directly and use context to fill in any text it can’t quite read. I have done this systematically with test image libraries and LLM’s win. Various measures used like cosine similarity etc. I must add I am unsure what OpenAI is doing behind the scenes. It could be using textract and then using the model to fill in the blanks.

thought umwelt

Explorer

How are some LLMs so good at OCR while on the other hand dedicated OCR solutions can't outperform them

Comments

Graph View

Backlinks