OCR Text Recognition: A Complete Guide to Document and Scene Text Labeling

Introduction: Why Does Your OCR Model Keep "Misreading"?

Optical Character Recognition (OCR) technology is nothing new — from the "scan to translate" feature on your phone to automatic parking gate systems, it's everywhere. But every engineer working on OCR model training knows that the flawless demo performance often falls apart in real-world scenarios.

Why? Because real-world text never sits neatly as black characters on white paper.

Uneven lighting, paper wrinkles, wildly varied fonts, complex background interference... all these factors confuse the model. And the deeper root cause often lies in training data quality. Based on our observations in the TjMakeBot community, over 70% of OCR model performance bottlenecks can ultimately be traced back to labeling data issues: bounding boxes not tight enough, special characters being ignored, chaotic handling of multilingual mixed text, and more.

This article won't discuss lofty algorithm theory. We'll only talk about the most practical "dirty work" — how to boost your OCR model's accuracy from 85% to 99% through high-quality data labeling. We'll share the hands-on experience and pitfall avoidance guide that the TjMakeBot team has distilled from processing millions of OCR data samples.

OCR Labeling Task Classification: Choosing the Right "Question Type" is Key

Before you start labeling, you must clarify exactly what your model needs to learn. Different business requirements correspond to entirely different labeling strategies — choose wrong, and all subsequent work heads in the wrong direction.

1. Text Detection: First Find "Where It Is"

Task essence: Think of it as a "spot the difference" game — you're only responsible for circling the text in the image, without caring what it says.

Labeling formats and use cases:

Bounding Box:
- When to use: The vast majority of standard documents, such as scanned contracts, books, and reports.
- Pitfall guide: Don't include line spacing! If two lines of text are too close together, the model easily merges them. Hug the text edges closely, but don't clip any strokes.
Rotated Box:
- When to use: Casually photographed invoices, tilted business cards, shipping labels on conveyor belts.
- Key detail: Beyond coordinates and dimensions, the angle must be accurately labeled. A common mistake is inconsistent angle definitions (e.g., clockwise vs. counterclockwise), causing the model's training loss to fail to converge.
Quadrilateral:
- When to use: Perspective distortion from off-angle photography — for example, a billboard photographed from the side where originally rectangular text becomes trapezoidal.
- Technique: Record the four vertex coordinates strictly in clockwise or counterclockwise order; otherwise, the polygon will "twist."
Polygon:
- When to use: Curved text on logos, wavy artistic text on T-shirts.
- Cost warning: This is the most time-consuming labeling method. Unless the business scenario absolutely requires it (e.g., artistic text recognition), try to approximate with quadrilaterals to balance cost.

2. Text Recognition: Understanding "What It Says"

Task essence: "Translating" cropped text images into computer-readable strings.

Deep dive:

Content transcription: What you see is what you get. But what about blurry, illegible characters? Never let annotators "guess"! Agree on a special symbol (such as ###) to represent unrecognizable characters — this is far better than forcing an incorrect character.
Language attributes: Don't assume all text is in one language. In cross-border e-commerce scenarios, a single label might contain Chinese, English, and Japanese simultaneously. Explicitly marking language types helps the model load the corresponding dictionary, dramatically improving accuracy.

3. End-to-End OCR: All in One Step

Task essence: Detection and recognition in a single pass. This is currently the most mainstream training approach.

Labeling challenge: Complex data structures. You need to maintain both positional information (Bbox) and semantic information (Text) within a single JSON object. Practical advice: Establish strict data validation scripts. We frequently encounter "waste data" where "the box was drawn but the text wasn't entered" or "the text was entered but the box is missing" — these are "poison" during training.

4. Document Layout Analysis: Understanding the "Structure"

Task essence: This isn't just about reading characters — it's about understanding the document's "skeleton."

Scenario examples:

Resume parsing: You need to tell the model that this bold, large text is "Work Experience" (a heading), and the smaller text below lists specific "Company Name" and "Position" (body text).
Table reconstruction: This is the hardest part. You need to not only box out cells but also clarify row and column relationships (Row/Col Span). If your business involves heavy report processing, you must develop extremely detailed table labeling specifications.

Scenario-Based Practice: Survival Rules for Different "Battlefields"

Scenario 1: Printed Documents — Deceptively Simple, Full of Traps

Printed documents may be neat, but their high density and varied layouts often lead to complacency.

Line-level vs. Word-level:
- For large blocks of body text, line-level labeling offers the best cost-performance ratio.
- But for invoice entries like "Amount: $100," if you need to extract key-value pairs, separating "Amount" and "$100" with word-level labeling is more beneficial for downstream information extraction models.
Table handling:
- Table lines are often broken or even completely absent (invisible tables). When labeling, don't rely on visual lines — instead, define cell boundaries based on the logical alignment of content.

Scenario 2: Scene Text — Chaos is the Norm

The biggest enemy of text in natural scenes is the environment.

Lighting and shadows: Tree shadows blocking half the text — label it or not? The principle: If a human can recognize it at a glance with context, label it; if even a human would have to guess for a while, mark it as "blurry/unreadable." Don't force-train the model to be clairvoyant.
Curved text: Use multi-point labeling (Polygon) to fit the text's centerline or upper/lower contours. Remember, more points aren't always better — too many points cause labeling jitter. Typically 5-8 points are sufficient to describe a curve.
Vertical text: Common on signs and traditional couplets. Be sure to add a direction: vertical attribute field. Many models default to horizontal scanning — without special handling for vertical text, the recognition output will be garbled.

Scenario 3: Handwritten Text — A Thousand Faces

Handwriting recognition is the "boss level" of OCR.

"What you see is what you get" principle: Annotators must overcome their "perfectionism." For example, if someone writes a character that looks like a different one based on the strokes, label what it looks like — or mark it as ambiguous. Don't use your own subjective judgment to "correct" the writer's mistakes, unless your task is specifically error correction.
Corrections and deletions:
- If text has been crossed out, it typically means it's invalid.
- Strategy 1: Don't label it at all — treat it as background noise.
- Strategy 2 (recommended): Select it and mark it as ignore or void category. Telling the model "there's text here, but don't read it" works better than making the model "pretend it doesn't see it," effectively reducing false detection rates.

Multilingual OCR Labeling: Breaking the Tower of Babel

Chinese Labeling Pitfalls

Character set explosion: There are thousands of commonly used Chinese characters, and rare characters are countless. Always unify the encoding format (UTF-8) and build a project-specific dictionary (Vocabulary). When encountering characters not in the dictionary, decide in advance whether to ignore them or replace with <UNK>.
Punctuation marks: The Chinese period 。 and the English dot . may differ by only a few pixels. They're extremely easy to confuse in low-resolution images. We recommend establishing labeling conventions based on contextual language cues.

English Labeling Details

Case sensitivity: PASSWORD and Password may be semantically identical, but in a password input field, they're completely different. Unless the business explicitly doesn't distinguish case, always label as-is.
Hyphenated line breaks: English typesetting often splits words across lines with hyphens (e.g., com-puter). When labeling, we recommend labeling the two lines separately but noting "split word" in the attributes, or merging during post-processing. Don't force a single large box across lines.

Handling Mixed Languages

Don't try to fragment a Chinese sentence with embedded English words. For example, "My iPhone is broken" — label it as a single unit. Only consider splitting when the language switch causes obvious layout changes (such as sudden font or size changes).

Tools and Workflow: Sharpen Your Axe Before Chopping Wood

Why You Need Professional Labeling Tools

Using paint software or generic object detection tools for OCR labeling is like digging a well with a spoon. Professional OCR labeling tools (like TjMakeBot's built-in tools) solve several core pain points:

Instant snapping: One click, and four points automatically snap to text edges — no pixel-level fine-tuning needed.
OCR pre-recognition: Run a foundation model first to auto-fill text content. Annotators only need to correct typos. This can boost efficiency by over 300%.
Polygon editing: Supports Bezier curves or multi-point dragging, specifically designed for curved text.

An Efficient Labeling SOP (Standard Operating Procedure)

Data cleaning: First remove images that are completely illegible or contain no text and no meaningful background.
Pre-annotation: Run existing models over the data for a first pass.
Human correction: This is the core step. Focus on checking missed detections (Recall) and false detections (Precision).
Cross-review QA: Person A's labels are checked by Person B. For OCR projects, character-level accuracy requirements are extremely high — a single wrong digit could render a financial report completely unusable. We recommend double-blind review or a sampling rate of 10% or higher.

Real-World Case Studies

Case 1: VAT Invoice Recognition — Precision is Money

Challenge: Invoice printing often has alignment shifts, causing text to overlap with table lines; tax stamps are frequently stamped over critical numbers.
Solutions:
- For text overlapping lines, slightly expand the bounding box to include some table lines, teaching the model to be "interference-resistant."
- For text obscured by stamps, if the human eye can see through it, label it; if it's completely illegible, discard it. Never let annotators reverse-calculate an obscured total from "unit price x quantity" — this teaches the model to "guess blindly."

Case 2: Street Storefront Signs — Battling Complex Environments

Challenge: Shop signs feature wildly varied fonts (calligraphy, neon lights) and are frequently obstructed (tree branches, power lines).
Solutions:
- Introduce a "legibility grading" label (Legibility: High/Medium/Low). In early training, only let the model learn from High and Medium data. Once the model stabilizes, introduce Low data for Hard Example Mining.

Conclusion

OCR labeling isn't simple manual labor — it's the first and most critical step in teaching machines to "read."

Many times, when you feel the algorithm has hit a ceiling and parameter tuning isn't working, looking back at the data often reveals a breakthrough. A dataset with clear definitions, precise boundaries, and high consistency is the unsung hero behind every high-performance OCR model.

In today's rapidly evolving AI landscape, tools will change and models will change, but reverence for data quality never changes. We hope this guide becomes the foundation for building your powerful OCR system.