NAME
PDF::OCR2 - extract all text and all image ocr from pdf
SYNOPSIS
use PDF::OCR2;
my $p = PDF::OCR2->new('./path/to/file.pdf');
my $text_all = $p->text;
my @text_pages = $p->text;
DESCRIPTION
This is meant to replace PDF::OCR. The backend complexity of this process has been isolated in modules:
PDF::GetImages
PDF::Burst
Image::OCR::Tesseract
PDF::OCR2::Pages - in this distro.
Why not just modify PDF::OCR?? This is such a massive breakdown of code hierachy and interdependency, and such a different interface, that this made more sense. PDF::OCR was ok. But it was messy and really, this is a lot better.
METHODS
new()
Argument is path to pdf file.
text()
Takes no argument. In scalar context, returns text of all pages, joined with a pagebreak \f character. In list context, returns text of pages one per element.
CAVEATS
This only works on posix.
BUGS
CRIT AND SUGGESTIONS
The AUTHOR is open to any suggestions and requests.
SEE ALSO
CAM::PDF PDF::API2 PDF::GetImages PDF::Burst PDF::OCR2::Page
REPLACES
PDF::OCR - deprecated
AUTHOR
Leo Charre leocharre at cpan dot org