NAME
Spreadsheet::XLSX::Reader::LibXML - Read xlsx spreadsheet files with LibXML
SYNOPSIS
The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::XLSX::Reader::LibXML;
my $parser = Spreadsheet::XLSX::Reader::LibXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );
if ( !defined $workbook ) {
die $parser->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
last;# In order not to read all sheets
}
###########################
# SYNOPSIS Screen Output
# 01: Row, Col = (0, 0)
# 02: Value = Category
# 03: Unformatted = Category
# 04:
# 05: Row, Col = (0, 1)
# 06: Value = Total
# 07: Unformatted = Total
# 08:
# 09: Row, Col = (0, 2)
# 10: Value = Date
# 11: Unformatted = Date
# 12:
# 13: Row, Col = (1, 0)
# 14: Value = Red
# 16: Unformatted = Red
# 17:
# 18: Row, Col = (1, 1)
# 19: Value = 5
# 20: Unformatted = 5
# 21:
# 22: Row, Col = (1, 2)
# 23: Value = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25:
# More intermediate rows ...
# 82:
# 83: Row, Col = (6, 2)
# 84: Value = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################
DESCRIPTION
This is another module for parsing Excel 2007+ workbooks. The goal of this package is three fold. First, as close as possible produce the same output as is visible in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere as close as is reasonable to the Spreadsheet::ParseExcel API (where it doesn't conflict with the first objective) so that less work would be needed to integrate ParseExcel and this package. Third, to provide an XLSX sheet parser that is built on XML::LibXML. The other two primary options for XLSX parsing on CPAN use either a one-off XML parser (Spreadsheet::XLSX) or XML::Twig (Spreadsheet::ParseXLSX). In general if either of them already work for you without issue then there is no reason to change to this package. I personally found some bugs and functionality boundaries in both that I wanted to improve and by the time I had educated myself enough to make improvement suggestions including root causing the bugs to either the XML parser or the reader logic I had written this.
In the process of learning and building I also wrote some additional features for this parser that are not found in the Spreadsheet::ParseExcel package. For instance in the SYNOPSIS the '$parser' and the '$workbook' are actually the same class. You could combine both steps by calling new with the 'file_name' attribute called out. Afterward it is still possible to call ->error on the instance. Another improvement (From my perspective) is date handling. This package allows for a simple pluggable custom output format that is more flexible than other options as well as handling dates older than 1-January-1900. I leveraged coercions from Type::Tiny to do this but anything that follows that general format will work here. Additionally, this is a Moose based package. As such it is designed to be (fairly) extensible by writing roles and adding them to this package rather than requiring that you extend the package to some new branch. Read the full documentation for all opportunities!
In the realm of extensibility, XML::LibXML has multiple ways to read an XML file but this release only has an XML::LibXML::Reader parser option. Future iterations could include a DOM parser option. Additionally this package does not (yet) provide the same access to the formatting elements provided in Spreadsheet::ParseExcel. That is on the longish and incomplete TODO list.
The package operates on the workbook with three primary tiers of classes. All other classes in this package are for architectual extensibility.
---> Workbook level (This class)
---> Worksheet level
---> Cell level - optional
Primary Methods
These are the primary ways to use this class. They can be used to open an .xlsx workbook. They are also ways to investigate information at the workbook level. For information on how to retrieve data from the worksheets see the Worksheet documentation. For additional workbook options see the Attributes section. The attributes section also documents all the methods used to adjust the attributes of this class.
new( %attributes )
Definition: This is the way to instantiate an instance of this class. It can accept all the Attributes, some, or none. If the instance is started with no arguments then a method is needed to open the xlsx file.
Accepts: the Attributes
Returns: An instance of this class
parse( $file_name, $formatter )
Definition: This is a convenience method to match the Spreadsheet::ParseExcel equivalent. It only works if the file_name attribute was not set with ->new. It is one way to set the file_name and default_format_list
Accepts:
$file_name = of a valid xlsx file (required)
$formatter = see the 'default_format_list' attribute for valid options (optional)
Returns: itself when passing with the xlsx file loaded or undef for failure
worksheet( $name )
Definition: This method will return an object to read values in the worksheet. If no value is passed to $name then the 'next' worksheet in physical order is returned. 'next' will NOT wrap
Accepts: the $name string representing the worksheet object you want to open
Returns: a Worksheet object with the ability to read the worksheet of that name. Or in 'next' mode it returns undef if past the last sheet
Example: using the implied 'next' worksheet;
while( my $worksheet = $workbook->worksheet ){
print "Reading: " . $worksheet->name . "\n";
# get the data needed from this worksheet
}
start_at_the_beginning
Definition: This restarts the 'next' worksheet at the first worksheet
Accepts:nothing
Returns: nothing
worksheets
Definition: This method will return all the worksheets in the workbook as an array. Not an array ref.
Accepts:nothing
Returns: an array of Worksheet objects with all the available worksheets in the array
worksheet_name( $Int )
Definition: This method returns the worksheet name for a given physical position in the worksheet from left to right. It counts from zero even if the workbook is in 'count_from_one' mode.
Accepts:integers
Returns: the worksheet name
Example: To return only worksheet positions 2 through 4
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
worksheet_names
Definition: This method returns an array ref of the worksheet names in the workbook.
Accepts:nothing
Returns: an array ref
Example: Another way to parse a workbook without building all the sheets at once is;
for $sheet_name ( @{$workbook->worksheet_names} ){
my $worksheet = $workbook->worksheet( $sheet_name );
# Read the worksheet here
}
number_of_sheets
Definition: This method returns the count of worksheets in the workbook
Accepts:nothing
Returns: an integer
error
Definition: This returns the most recent error message logged by the package. This method is mostly relevant when an unexpected result is returned by some other method.
Accepts:nothing
Returns: an error string.
get_epoch_year
Definition: This returns the epoch year defined by the worsheet.
Accepts:nothing
Returns: 1900 (= windows) or 1904 (= 1904)
parse_excel_format_string( $format_string )
Definition: This returns a Type::Tiny object with built in chained coercions to turn Excel Julian Dates into date strings.
Accepts: a custom $format_string complying with Excel definitions
Returns: a Type::Tiny object
Attributes
Data passed to new when creating an instance (parser). For modification of these attributes see the listed 'attribute methods'. For more information on attributes see Moose::Manual::Attributes.
error_inst
Definition: This attribute holds an 'error' object instance. It should have several methods for managing errors. Currently no error codes or error translation options are available but this should make implementation of that easier.
Default: a Spreadsheet::XLSX::Reader::LibXML::Error instance with the attributes set as;
( should_warn => 0 )
Range: The minimum list of methods to implement for your own instance is;
error set_error clear_error set_warnings if_warn
attribute methods Methods provided to adjust this attribute
=get_error_inst
Definition: returns this instance
error
Definition: Used to get the most recently logged error
set_error
Definition: used to set a new error string
clear_error
Definition: used to clear the current error string in this attribute
set_warnings
Definition: used to turn on or off real time warnings when errors are set
if_warn
Definition: a method mostly used to extend this package and see if warnings should be emitted.
file_name
Definition: This attribute holds the full file name and path for the xlsx file to be parsed.
Default no default - this must be provided to read a file
Range any unincrypted xlsx file that can be opened in Microsoft Excel
attribute methods Methods provided to adjust this attribute
set_file_name
Definition: change the set file name (this will reboot the workbook instance)
has_file_name
Definition: this is fundamentally a way to see if the workbook loaded correctly
file_creator
Definition: This holds the information stored in the Excel Metadata for who created the file originally. You shouldn't set this attribute yourself.
Default the value from the file
Range A string
attribute methods Methods provided to adjust this attribute
creator
Definition: returns the name of the file creator
file_date_created
Definition: This holds the created date in the Excel Metadata for when the file was first built. You shouldn't set this attribute yourself.
Default the value from the file
Range A timestamp string (ISO ish)
attribute methods Methods provided to adjust this attribute
date_created
Definition: returns the date the file was created
file_modified_by
Definition: This holds the information stored in the Excel Metadata for who modified the file last. You shouldn't set this attribute yourself.
Default the value from the file
Range A string
attribute methods Methods provided to adjust this attribute
modified_by
Definition: returns the user name of the person who last modified the file
file_date_modified
Definition: This holds the last modified date in the Excel Metadata for when the file was last changed. You shouldn't set this attribute yourself.
Default the value from the file
Range A timestamp string (ISO ish)
attribute methods Methods provided to adjust this attribute
date_modified
Definition: returns the date when the file was last modified
sheet_parser
Definition: This sets the way the .xlsx file is parsed. For now the only choice is 'reader'.
Default 'reader'
Range 'reader'
attribute methods Methods provided to adjust this attribute
set_parser_type
Definition: the way to change the parser type
get_parser_type
Definition: returns the currently set parser type
count_from_zero
Definition: Excel spreadsheets count from 1. Spreadsheet::ParseExcel counts from zero. This allows you to choose either way.
Default 1
Range 1 = counting from zero like Spreadsheet::ParseExcel, 0 = Counting from 1 lke Excel
attribute methods Methods provided to adjust this attribute
counting_from_zero
Definition: a way to check the current attribute setting
set_count_from_zero
Definition: a way to change the current attribute setting
file_boundary_flags
Definition: When you request data past the end of a row or past the bottom of the data this package can return 'EOR' or 'EOF' to indicate that state. This is especially helpful in 'while' loops. The other option is to return 'undef'. This is problematic if some cells in your table are empty which also returns undef.
Default 1
Range 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when requesting a position that is out of bounds
attribute methods Methods provided to adjust this attribute
boundary_flag_setting
Definition: a way to check the current attribute setting
change_boundary_flag
Definition: a way to change the current attribute setting
empty_is_end
Definition: The excel convention is to read the table left to right and top to bottom. Some tables have uneven columns from row to row. This allows the several methods that take 'next' values to wrap after the last element with data rather than going to the max column.
Default 0
Range 1 = treat all columns short of the max column for the sheet as being in the table, 0 = end each row after the last cell with data rather than going to the max sheet column
attribute methods Methods provided to adjust this attribute
is_empty_the_end
Definition: a way to check the current attribute setting
set_empty_is_end
Definition: a way to set the current attribute setting
from_the_edge
Definition: Some data tables start in the top left corner. Others do not. I don't reccomend that practice but when aquiring data in the wild it is often good to adapt. This attribute sets whether the file reads from the top left edge or from the top row with data and starting from the leftmost column with data.
Default 1
Range 1 = treat the top left corner of the sheet even if there is no data in the top row or leftmost column, 0 = Set the minimum row and minimum columns to be the first row and first column with data
attribute methods Methods provided to adjust this attribute
set_from_the_edge
Definition: a way to set the current attribute setting
default_format_list
Definition: This is a departure from Spreadsheet::ParseExcel for two reasons. First, it doesn't use the same modules. Second, this accepts a role with two methods where ParseExcel accepts an object instance.
Default Spreadsheet::XLSX::Reader::LibXML::FmtDefault
Range a Moose role with the methods 'get_defined_excel_format' and 'change_output_encoding' it should be noted that libxml2 which is the underlying code for XML::LibXML allways attempts to get the data into perl friendly strings. That means this should only tweak the data on the way out and does not affect the data on the way in.
attribute methods Methods provided to adjust this attribute
get_default_format_list
Definition: a way to check the current attribute setting
set_default_format_list
Definition: a way to set the current attribute setting
format_string_parser
Definition: This is the interpreter that turns the excel into a Type::Tiny coercion. If you don't like the output or the method you can write your own Moose Role and add it here.
Default Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
Range a Moose role with the method 'parse_excel_format_string'
attribute methods Methods provided to adjust this attribute
get_format_string_parser
Definition: a way to check the current attribute setting
set_format_string_parser
Definition: a way to set the current attribute setting
group_return_type
Definition: Traditionally ParseExcel returns a cell object with lots of methods to reveal information about the cell. In reality this is probably not used very much so in the interest of simplifying you can get a cell object instance set to the cell information. Or you can just get the raw value in the cell or you can get the cell value formatted either the way the sheet specified or the way you specify. See the 'custom_formats' attribute for the Spreadsheet::XLSX::Reader::LibXML::Worksheet class to insert custom targeted formats for use with the parser. All empty cells return undef no matter what.
Default instance
Range instance = returns a populated Spreadsheet::XLSX::Reader::LibXML::Cell instance, unformatted = returns the raw value of the cell with no modifications, value = returns just the formatted value stored in the excel cell
attribute methods Methods provided to adjust this attribute
get_group_return_type
Definition: a way to check the current attribute setting
set_group_return_type
Definition: a way to set the current attribute setting
empty_return_type
Definition: Traditionally Spreadsheet::ParseExcel returns an empty string for cells with unique formatting but no stored value. It may be that the more accurate way of returning undef works better for you. This will turn that behaviour on. "If Excel stores an empty string having this attribute set to 'undef_string' will still return the empty string!"
Default empty_string
Range empty_string = populates the unformatted value with '' even if it is set to undef undef_string = if excel stores undef for an unformatted value it will return undef
attribute methods Methods provided to adjust this attribute
get_empty_return_type
Definition: a way to check the current attribute setting
set_empty_return_type
Definition: a way to set the current attribute setting
BUILD / INSTALL from Source
1. Ensure that you have the libxml2 and libxml2-devel libraries installed using your favorite package installer
2. Download a compressed file with the code from your favorite source
3. Extract the code from the compressed file. If you are using tar this should work:
tar -zxvf Spreadsheet-XLSX-Reader-LibXML-v0.xx.tar.gz
4. Change (cd) into the extracted directory
5. Run the following
(For Windows find what version of make was used to compile your perl)
perl -V:make
(for Windows below substitute the correct make function (s/make/dmake/g)?)
>perl Makefile.PL
>make
>make test
>make install # As sudo/root
>make clean
SUPPORT
TODO
1. Build Alien::LibXML::Devel to load the libxml2-devel libraries from source and require that and Alien::LibXML in the build file. So all needed requirements for XML::LibXML are met
Both libxml2 and libxml2-devel libraries are required for XML::LibXML
2. Add a pivot table reader (Not just read the values from the sheet)
3. Add calc chain methods
4. Add more exposure to workbook formatting methods
5. Build a DOM parser alternative for the sheets
(Theoretically faster than the reader but uses more memory)
AUTHOR
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2014 by Jed Lund
DEPENDENCIES
Type::Tiny - 0.046
MooseX::ShortCut::BuildInstance - 1.026
SEE ALSO
Spreadsheet::ParseExcel - Excel 2003 and earlier
Spreadsheet::XLSX - 2007+
Spreadsheet::ParseXLSX - 2007+
All lines in this package that use Log::Shiras are commented out