Back to Projects
Project Details

OCR Correction

.NET library for correcting common OCR errors with 1,000+ professionally-tested patterns

NuGet NuGet Downloads License .NET

Overview

ZentrixLabs.OcrCorrection is a comprehensive .NET library for correcting common OCR errors in English text, specifically designed for subtitle extraction and document digitization workflows. With ~1,000 professionally-tested patterns, it achieves 100% success rate on real-world subtitle files with zero false positives.

Available on NuGet.org

Why This Library?

OCR technology often produces specific, predictable errors when processing text - especially in subtitles extracted from Blu-ray discs and DVDs. This library was built from analyzing real-world OCR output from feature films and contains patterns that fix the most common issues without breaking valid words.

Built for Real-World Use

Tested on 10 feature-length films (10,000+ subtitles) with perfect results:

  • 536+ OCR errors corrected
  • 0 remaining errors
  • 0 false positives
  • 100% success rate

Key Features

Comprehensive Pattern Coverage

  • ~1,000 Correction Patterns across 5 categories
  • Character Substitution (~668 patterns): Pipe → I, capital I ↔ lowercase l
  • Spacing Errors (~287 patterns): Extra/missing spaces, compound words
  • Apostrophe Issues (~42 patterns): Contractions, missing apostrophes
  • Number Confusion (~20 patterns): Letter/number confusion in numeric contexts
  • Context-Aware: Patterns designed to only fix actual errors

Multi-Pass Processing

  • Automatic Convergence: Stops when no more corrections are found
  • Configurable Passes: Quick (1), Standard (3), or Thorough (5) modes
  • Smart Detection: Each pass catches different error types
  • Performance Optimized: Fast regex-based corrections

Safe & Reliable

  • No False Positives: Patterns specifically avoid breaking valid words
  • Extensively Tested: Verified on 10,000+ real subtitles
  • Production Ready: Used in SrtExtractor for professional subtitle processing
  • MIT Licensed: Free to use in any project, including commercial

Quick Start

Installation

Install via NuGet Package Manager:

dotnet add package ZentrixLabs.OcrCorrection

Or via Package Manager Console:

Install-Package ZentrixLabs.OcrCorrection

Basic Usage

using ZentrixLabs.OcrCorrection.Core;
using ZentrixLabs.OcrCorrection.Patterns;

// Create the correction engine
var patternProvider = new EnglishPatternProvider();
var engine = new OcrCorrectionEngine(patternProvider);

// Correct OCR errors
var text = "HeIIo! I dont think th is looks right.";
var result = engine.CorrectText(text);

Console.WriteLine(result.CorrectedText);
// Output: "Hello! I don't think this looks right."

Console.WriteLine($"Corrections made: {result.CorrectionsMade}");
// Output: "Corrections made: 4"

Multi-Pass Processing

using ZentrixLabs.OcrCorrection.Passes;

var multiPass = new MultiPassProcessor(engine);

var result = await multiPass.ProcessAsync(
    text, 
    maxPasses: 5,
    options: new CorrectionOptions { IncludeDetailedLog = true }
);

Console.WriteLine($"Converged after {result.PassesCompleted} passes");
Console.WriteLine($"Total corrections: {result.TotalCorrections}");

Batch Processing Example

// Read SRT file
var srtContent = File.ReadAllText("movie.srt");

// Correct OCR errors
var result = engine.CorrectText(srtContent);

// Save corrected SRT
File.WriteAllText("movie_corrected.srt", result.CorrectedText);

Console.WriteLine($"✅ Corrected {result.CorrectionsMade} errors");

Common OCR Errors Fixed

Pipe Character (|) → Letter I

One of the most common OCR errors. The library includes ~668 comprehensive patterns:

| keep seeing → I keep seeing
- | am cold → - I am cold
| think | know → I think I know

Works at line start, after dash, after punctuation, and before verbs.

Capital I ↔ Lowercase l Confusion

The most common OCR error in subtitles. ~660 specific patterns:

HeIIo → Hello
I'm gIad → I'm glad
TeII me → Tell me
stiII → still

Spacing Errors

~287 patterns for various spacing issues:

Extra spaces:     th e → the, wh at → what
Missing spaces:   thejob → the job, ofthose → of those
After punctuation: Thanks.Next → Thanks. Next
Compound words:   prettylucky → pretty lucky
-tion/-ation:     confus i on → confusion

Apostrophe Issues

~42 patterns for contractions and possessives:

Missing:   dont → don't, youre → you're
Malformed: you)re → you're, I)m → I'm
Wrong char: That''s → That's, We''ll → We'll

Number Confusion

~20 patterns for numeric context:

Letter to number: I 00 → 100, $I O → $10
Number to letter: 0 → O, 1 → I (context-aware)

Real-World Testing Results

Library tested on 10 Tesseract PGS extractions with perfect results:

FilmSubtitlesCorrectionsResult
28 Weeks Later (2007)1,2372✅ Perfect
28 Years Later (2025)1,2314✅ Perfect
28 Days Later (2002)1,23242✅ Perfect
Alien (1979)984109✅ Perfect
Alien: Covenant (2017)1,515158✅ Perfect
AvP: Requiem (2007)1,100+10✅ Perfect
A View to a Kill (1985)965190✅ Perfect
Akira (1988)1,200+0✅ Perfect
Airplane II (1982)1,800+16✅ Perfect

Total: 10,000+ subtitles processed, 536+ errors corrected, 0 remaining errors, 0 false positives.

Advanced Features

Configuration Options

var options = new CorrectionOptions
{
    // Include detailed correction log
    IncludeDetailedLog = true,
    
    // Include performance metrics
    IncludePerformanceMetrics = true,
    
    // Include details about each correction
    IncludeCorrectionDetails = true,
    
    // Exclude specific pattern categories
    ExcludedCategories = new[] { "Numbers" },
    
    // Context-aware capitalization (experimental)
    UseContextAwareCapitalization = false
};

Filtering by Category

var patternProvider = new EnglishPatternProvider();

// Get only spacing-related patterns
var spacingPatterns = patternProvider.GetPatternsByCategory("Spacing");

// Get all available categories
var categories = patternProvider.GetCategories();
// Returns: ["Apostrophes", "Capitalization", "Character Substitution", 
//           "Numbers", "Spacing"]

Custom Pattern Providers

public class MyCustomPatternProvider : IPatternProvider
{
    public string Name => "Custom Patterns";
    public string LanguageCode => "en";
    
    public IEnumerable<CorrectionPattern> GetPatterns()
    {
        return new[]
        {
            new CorrectionPattern(
                @"\bcustomerror\b", 
                "custom error", 
                "Custom")
            {
                Description = "Fix custom error",
                Priority = 50
            }
        };
    }
}

// Use custom provider
var engine = new OcrCorrectionEngine(new MyCustomPatternProvider());

Dependency Injection

using Microsoft.Extensions.DependencyInjection;
using ZentrixLabs.OcrCorrection.Extensions;

var services = new ServiceCollection();
services.AddOcrCorrection();

var serviceProvider = services.BuildServiceProvider();
var engine = serviceProvider.GetRequiredService<IOcrCorrectionEngine>();

Use Cases

Subtitle Extraction

Clean up OCR errors from PGS/VobSub/ASS subtitle extraction. Used in production by SrtExtractor to automatically correct thousands of subtitle files.

Typical Results:

  • 1,000+ corrections per subtitle file
  • 80,000+ corrections across batch operations
  • Professional-quality output with zero manual intervention

Document Digitization

Fix OCR errors in scanned documents, historical texts, and digitized archives. The library’s patterns are designed to handle common Tesseract OCR issues.

Post-Processing Pipeline

Integrate into automated OCR workflows to ensure clean, accurate text output. Multi-pass processing ensures maximum quality while smart convergence maintains performance.

Historical Text

Correct OCR errors in digitized historical documents where formatting and character recognition may be inconsistent.

Performance

Tested on feature-length films (900-1,500 subtitle entries):

  • Average Processing Time: ~900ms per film
  • Typical Corrections: 2-200 errors per film
  • Success Rate: 100% on tested corpus
  • Memory Efficient: Optimized regex patterns with intelligent caching
  • Scalable: Handles large batch operations efficiently

Architecture

Built with modern .NET 8 best practices:

  • Pattern-Based Design: Extensible pattern provider system
  • Async/Await: Full asynchronous support
  • Dependency Injection Ready: Native DI support
  • POCO Models: Simple, serializable result objects
  • Zero Dependencies: Uses only built-in .NET libraries
  • Thread-Safe: Safe for concurrent processing

Pattern Categories

The library organizes patterns into logical categories:

  • Character Substitution (~668 patterns): Pipe → I, capital I ↔ lowercase l confusion
  • Spacing Patterns (~287 patterns): Extra/missing spaces, compound words, punctuation spacing
  • Apostrophe Patterns (~42 patterns): Contractions, missing apostrophes, wrong characters
  • Number Patterns (~20 patterns): Letter/number confusion in numeric contexts
  • Capitalization (experimental): Context-aware sentence-start capitalization

Important Design Decisions

No Generic Patterns

Early versions included generic “missing space” patterns that caused false positives:

// ❌ DANGEROUS - breaks valid words
(\w)(are)(\s) → "$1 $2$3"  // Breaks: "fanfare" → "fanf are"
(\w)(he)(\s) → "$1 $2$3"   // Breaks: "she" → "s he"

Current Approach: Only specific, verified patterns that don’t break valid words.

Experimental Features

Context-aware capitalization is available but disabled by default due to edge cases with contractions and proper nouns. Recommended to keep disabled for production use.

Requirements

  • .NET 8.0 or higher
  • No external dependencies

Contributing

Contributions are welcome! If you find OCR errors that aren’t being corrected:

  1. Analyze the error pattern
  2. Create specific patterns (avoid overly generic patterns)
  3. Test thoroughly to ensure no false positives
  4. Submit a pull request with test cases

Community & Support

Issues & Questions

Testing

The library includes comprehensive test coverage:

  • Unit tests for each pattern category
  • Integration tests on real subtitle files
  • Performance benchmarks
  • False positive detection

Acknowledgments

Built from analyzing real-world OCR errors in:

  • Tesseract OCR output from Blu-ray PGS subtitle extraction
  • Feature film subtitle files
  • Document digitization projects

Special thanks to the .NET community for regex optimization techniques.

License

Licensed under the MIT License.

You are free to use, modify, and distribute - including in commercial products - with attribution.


Available on NuGet | Open Source | MIT Licensed

Built with ❤️ by ZentrixLabs for the OCR community

Support This Project

If you find this project helpful, consider buying me a coffee! ☕

Buy Me A Coffee