NAME

PDF::OCR - get ocr and images out of a pdf file

SYNOPSIS

use PDF::OCR;

my $p = new PDF::OCR('/path/to/file.pdf');



my $images = $p->abs_images; # extract images, get list of paths

for( @{$p->abs_images} ){ # get ocr content for each

	my $content = $p->get_ocr($_);
	
	 print "image $_ had $content\n\n";
}


my $ocrs = $p->get_ocr; # get ocr content for all as one scalar with pagebreaks

 print "$abs_pdf had [$ocrs]\n";

# get all content of all images as array ref
my @ocrs = @{ $p->get_ocr_arrayref };

 print "$abs_pdf had [@ocrs]\n";

DESCRIPTION

The whole process does not change your original pdf in any way.

Please note this is only to get text out of images inside the pdf file, it does not check for genuine text inside the file- if any. For that please see PDF::OCR::Thorough

METHODS

new()

Argument is pdf file you want to run ocr on.

my $o = new PDF::OCR('/path/to/file.pdf');

This will copy the file to a tmp file.

abs_images()

returns array ref with images extracted from the pdf

get_ocr()

optional argument is abs path of image extracted from pdf returns ocr content

if no argument is given, all image ocr contents are concatenated and returned as scalar (with pagebreak chars, can be regexed with \f)

get_ocr_arrayref()

get all ocr images content as array ref

cleanup()

erase temp file and all image files extracted

SEE ALSO

PDF::GetImages Image::OCR::Tesseract

AUTHOR

Leo Charre leocharre at cpan dot org

COPYRIGHT

Copyright (c) 2007 Leo Charre. All rights reserved.

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the "GNU General Public License" for more details.