In short: A scanned PDF is an image - the text in it is pixels, not characters. OCR turns it into real text you can search, copy, and edit. For Hebrew you need a tool adapted to RTL - a generic OCR tool will reverse the text.
When Your PDF Is a Scan
Not every PDF is a scan. There are two types:
Digital PDF - created directly from a computer (Word, Excel, InDesign, a government website). Text is stored as characters. Searchable and editable immediately.
Scanned PDF - a paper document that went through a scanner, or a document photographed with a phone. It's an image. The text doesn't exist - only pixels that look like letters.
How to Tell the Difference
Try to select text with your mouse:
- Selection works - digital PDF, OCR not needed
- Can't select anything - scan, needs OCR
- Ctrl+F finds nothing - scan
Why Regular OCR Fails on Hebrew
Most OCR tools in the world were built primarily for English. They assume text flows left to right, and that words are clearly separated. Hebrew breaks both assumptions:
1. RTL Direction (Right to Left)
An unadapted tool reads the letters of each word in reverse order. "Shalom" (שלום) becomes "molahS" (םולש). Across a full line - all words get reversed.
2. Letter Confusion
Some Hebrew letters look similar to each other. An OCR tool not trained on Hebrew confuses: ב/כ, ן/ו, ה/ח, מ/ס, and ג/ז.
3. Mixed-Direction Lines
In mixed text (Hebrew and numbers, Hebrew and English), an unadapted tool scatters the line segments in the wrong order.
4. Vowel Marks (Nikud)
Vowel marks (dots under letters) are unique to Hebrew and Arabic. A tool that doesn't recognize them treats them as noise and loses accuracy.
The Make Searchable tool in Kovetz PDF was trained on Hebrew and handles all these challenges.
What Determines OCR Quality
Scan Quality - the Deciding Factor
| Scan Type | Estimated OCR Accuracy |
|---|---|
| Professional scan, 300+ DPI, straight | 95-99% |
| Home scanner, 200 DPI | 85-92% |
| Phone photo, good lighting | 80-90% |
| Phone photo, poor lighting | 60-75% |
| Old document, crumpled | 70-85% |
Factors That Improve Results
Before scanning:
- Make sure the document is flat and not crumpled
- Scan at 300 DPI minimum
- Even lighting, no shadows
After OCR:
- Check names, numbers, and dates - OCR tends to err there
- ID numbers and account numbers - always verify manually
Common Use Cases
Scanned Pay Stub
You received a pay stub that was scanned (not digitally generated). OCR turns it into searchable text you can copy from and submit to the bank as usable proof of income.
Old Lease Agreement
A lease contract from 2005 that only exists as a scan. OCR allows you to search for clauses, copy sections, and use the text in legal correspondence.
Government Form with Handwriting
Government forms filled in by hand - OCR can attempt recognition, but handwriting is the hardest challenge. Accuracy for handwriting is 70-80%. Always verify manually.
Medical Records
Scanned medical documents - OCR enables searching and browsing. Drug names and dosages - always verify manually.
Online OCR vs Desktop Software
There are three main categories of OCR solutions for Hebrew, each suited to a different use case:
Online OCR tools (Kovetz PDF and similar)
Pros:
- No installation, works from the browser
- Quick use for a single file
- Free for most routine cases
- Trained on Hebrew when it's an Israeli-built tool
Cons:
- Limited file size (usually 25-100MB)
- Requires upload (privacy concern for sensitive files)
- Fewer advanced processing options
Best for: routine usage - pay stubs, contracts, medical files. Most users fall here.
Professional desktop software
Pros:
- Batch processing of hundreds of files simultaneously
- Precise control over settings (black-and-white threshold, alignment, perspective correction)
- Local processing - the file never leaves the computer
- Support for rare formats
Cons:
- High annual cost (typically $50-200 per single license)
- Steep learning curve
- Requires a computer with reasonable resources
Best for: archives, libraries, academic institutions processing thousands of pages per month.
Built-in OCR in operating systems
Pros:
- Free, already installed
- Native integration with the file system
Cons:
- Weak Hebrew support
- Missing religious and specialized fonts
- No automatic correction of scan angles
Best for: English scans or basic modern Hebrew text. Not for vowelized texts or sensitive documents.
Hebrew with Nikud - A Special Challenge
Hebrew vowel marks (sheva, patach, hiriq, etc.) pose a unique challenge to OCR. The reasons:
1. Nikud sits above/below the letter, not inside it - an OCR tool not designed for Hebrew sees nikud as "noise" and ignores or misreads it.
2. Different letters with the same nikud - "סָפַר" (he counted) and "סֵפֶר" (book) look similar. A good OCR tool distinguishes between nikud variants; a weak one merges them.
3. Lower accuracy overall - even in good OCR, accuracy on vowelized text is 75-85% vs 95%+ on unvowelized text.
What to do:
- If the goal is only search, you can accept wrong nikud - what matters is that the letters are right
- If the goal is precise learning (Tanakh, Siddur), verify nikud manually after OCR
- If the source isn't vowelized but you want nikud after OCR - use the Hebrew nikud tool separately
OCR Accuracy by Scan Type - Detailed Comparison
Beyond the general table, there are significant gaps between source types:
Used book scan (creased, stained)
Expected accuracy: 60-78%. Failure causes: faded ink, coffee stains, curling corners. Improvement: before uploading - raise contrast in an image editor, crop the margins.
Fresh office document scan (laser printer)
Expected accuracy: 95-99%. The easiest case. Nothing to do in advance - upload and download.
Phone photo of a paper document
Expected accuracy: 80-90% (in good lighting). Sharp drop in bad lighting (50-70%). Improvement: use the "document" mode in your phone camera, or a dedicated scanning app.
Screenshot of a scanned PDF
Expected accuracy: 85-92%. The screenshot is already compressed, but usually at acceptable quality. Improvement: zoom to 100% in the document before taking the screenshot.
Printed handwriting (a professional scribe)
Expected accuracy: 70-85%. Improvement: use "handwriting" mode if the tool offers it, or type manually - sometimes faster than correcting.
Regular handwriting (yours, mine)
Expected accuracy: 40-65%. Usually not worth the effort - just type manually.
After OCR - What You Can Do
Search
Ctrl+F finds words. Useful for long contracts, medical records, legal documents.
Copy
You can select and copy text into Outlook, Word, Excel. Names, numbers, addresses.
Accessibility
Screen readers (for visually impaired users) can read a PDF with an OCR layer, but not a raw scanned PDF.
Editing
For actual content editing - convert to Word after OCR. The PDF to Word tool runs OCR and produces an editable Word file in a single process.
What OCR Doesn't Solve
Handwriting
OCR for handwriting is a separate, more complex field. Lower accuracy than printed text OCR.
Password-Protected PDFs
A PDF protected with a password that requires opening - remove the protection first. The password removal tool can help if you have the password.
Complex Layouts
Complicated tables, angled text, multi-column layouts - OCR will recognize the letters, but the layout may scatter. For tables, consider conversion to Excel.
Very Low-Quality Scans
Very old, crumpled documents with stains - OCR will try but accuracy will be low. No tool can "invent" information that isn't readable in the scan.
How to Improve OCR Results Before Uploading
Before sending a document to an OCR tool, a few prep steps significantly increase accuracy:
Scan or photograph correctly
- Resolution - 300 DPI minimum for scanning. On a phone, use "document mode" in your camera app, not regular photo mode
- Lighting - even, no shadows. Natural overhead light is best
- Angle - keep the camera parallel to the document. Tilt distortion hurts OCR significantly
- Background - a clean background in a different color than the page helps with edge detection
Straighten the image before uploading
A tilted document (skew) is a common problem. Most modern OCR tools auto-correct, but not always perfectly. If your document is tilted by more than 5 degrees - straighten it manually in any image editor before uploading.
Create a clean image
If the page is photographed:
- Crop before uploading - leave only the page, no background around it
- Increase contrast if the text is very faint
- Convert to black-and-white if color isn't important (less noise)
Why Hebrew OCR Is Less Accurate Than English - Regardless of the Tool
Even the best tool will achieve lower accuracy in Hebrew than in English, roughly 3-5% lower on average. The reasons:
- Training data - OCR models were trained on billions of English pages, but only millions in Hebrew. The models simply "saw" less Hebrew
- Less font diversity - OCR tools expect to handle varied fonts, but when the variety is smaller, performance is lower
- Linguistic context - advanced OCR uses word knowledge to correct mistakes. The Hebrew dictionary is less active than English in these models
- Connected handwriting - some Hebrew handwriting styles are more connected than English handwriting, making letter-level segmentation harder
The practical takeaway: always budget time for manually reviewing OCR results in Hebrew, especially for names, numbers, and dates.
Common OCR Mistakes
Mistake 1: accepting the result without verification
Even good OCR doesn't reach 100%. When the file is a contract, medical chart, or financial report - a single digit error changes everything. Invest 2-3 minutes scanning through names, dates, and numbers.
Mistake 2: scanning at low quality because "OCR will fix it"
OCR doesn't invent information that isn't readable in the scan. If the source is blurry, the result will be poor even in the best tool. Invest a moment in a quality scan instead of an endless correction effort afterward.
Mistake 3: aggressively compressing a PDF before OCR
Aggressive compression of a scanned PDF lowers image quality and hurts OCR accuracy. The correct workflow: scan → OCR → compress. Not the reverse.
Mistake 4: assuming text shown on screen is real text
Some PDFs display beautiful text that is actually an image (scan). Quick check: select with the mouse. If you can't select - you need OCR.
Mistake 5: uploading a sensitive document to an unverified OCR site
Medical files, financial reports, legal contracts - all require care in choosing the tool. Check the tool's privacy policy: how long the file is kept, who has access, whether it's sent to third-party servers. For very sensitive texts, consider desktop software that processes locally.
Start Now
Have a scanned PDF you need to turn into searchable text? Make your PDF searchable here - trained on Hebrew, handles mixed Hebrew/English documents, and preserves the original layout of the file.
Related Guides
Want to make a PDF searchable now?
With full Hebrew support
Frequently Asked Questions
What is OCR and when do I need it?
OCR (Optical Character Recognition) is a process that turns an image of text into searchable, editable text. You need it when you have a PDF that was scanned from a paper document, photographed with a phone, or created as an image - not as digital text.
How do I know if my PDF is a scan?
Simple test: try to select and copy text in the PDF. If you can't select anything - it's a scan. If text is not searchable (Ctrl+F finds nothing) - it's a scan.
Why does regular OCR fail on Hebrew?
Most OCR tools were built for English and Latin scripts. Hebrew is RTL (right-to-left), has connected letters, and optional vowel marks. An unadapted tool will reverse the letter order, split words incorrectly, and confuse similar-looking letters.
What accuracy can I expect from Hebrew OCR?
Depends on the scan. A document scanned at 300 DPI or above, straight, with a clear font - good OCR can reach 95%+ accuracy. An old, crumpled, or phone-photographed document in poor lighting will give lower results.
Does OCR preserve the original layout?
The 'Make Searchable' process adds an invisible text layer over the original scan. The visual layout is preserved exactly - nothing moves. What changes: text becomes searchable, copyable, and highlightable.
Can OCR handle a document with both Hebrew and English?
Yes. The tool automatically detects and handles both languages in the same document. Hebrew RTL text and English LTR text each get their correct direction.
Can OCR read Hebrew vowel marks (nikud)?
Yes, a good Hebrew OCR tool recognizes vowel marks. But documents without vowel marks won't have them added after OCR - the tool recognizes what's in the scan.