Irreva logo
Explore Irreva

Extract Text from Scanned PDFs

Scanned PDFs are images — you can't select or copy their text. OCR (Optical Character Recognition) reads the images and converts them to actual text. Our PDF OCR tool uses Tesseract.js, a proven OCR engine, running entirely in your browser.

How to Extract Text from a Scanned PDF

  1. 1Go to the PDF OCR tool.
  2. 2Upload your scanned PDF.
  3. 3Select the language of the text (default: English). Choosing the correct language improves accuracy significantly.
  4. 4Click Extract Text. Processing time depends on the number of pages and scan quality.
  5. 5The extracted text appears in the output area.
  6. 6Click Download .txt to save the extracted text as a plain text file.

Tips for Better OCR Accuracy

Use high-resolution scans

300 DPI or higher gives the best results. Lower DPI scans produce blurry images that OCR misreads.

Ensure good contrast

Black text on white background is ideal. Low contrast, yellowed paper, or faded ink reduces accuracy.

Keep pages straight

Rotated or skewed pages significantly reduce accuracy. Scan pages flat and aligned.

Select the correct language

OCR uses language-specific character sets and dictionaries. Wrong language selection causes misidentification.

Frequently Asked Questions

What is the difference between a scanned PDF and a text-based PDF?â–¾

A text-based PDF was created digitally (from Word, Google Docs, or software) and contains actual text data you can select and copy. A scanned PDF is a photo of a printed document — it contains images of text, not real text. You can tell the difference by trying to select text with your cursor. If you can highlight individual words, it's text-based. If the cursor selects the whole page as an image, it's scanned.

What languages does the OCR support?â–¾

The OCR engine (Tesseract.js) supports 60+ languages including English, French, Spanish, German, Italian, Portuguese, Hindi, Arabic, Chinese (Simplified and Traditional), Japanese, Korean, and many others. Select your language before processing for best accuracy.

How accurate is browser-based OCR?â–¾

Accuracy depends heavily on the scan quality. A clean, high-contrast scan at 300 DPI typically achieves 95–99% accuracy for printed text. Handwritten text, low-resolution scans, rotated pages, and documents with complex layouts reduce accuracy significantly.

Is my PDF sent to a server?â–¾

No. The OCR runs using Tesseract.js, a WebAssembly port of the Tesseract OCR engine. Processing happens entirely in your browser — your documents never leave your device.

Can I extract text from a PDF that has both text and scanned pages?â–¾

The PDF OCR tool processes the visual content of all pages. For text-based pages, the extracted text will be very accurate. For scanned pages, OCR accuracy depends on scan quality. For text-based PDFs only, consider using the PDF to Word converter which uses the actual text data.

Related Tools