NAME

Spreadsheet::XLSX::Reader::LibXML - Read xlsx spreadsheet files with LibXML

SYNOPSIS

The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder

use strict;
use warnings;
use Spreadsheet::XLSX::Reader::LibXML;

my $parser   = Spreadsheet::XLSX::Reader::LibXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );

if ( !defined $workbook ) {
	die $parser->error(), "\n";
}

for my $worksheet ( $workbook->worksheets() ) {

	my ( $row_min, $row_max ) = $worksheet->row_range();
	my ( $col_min, $col_max ) = $worksheet->col_range();

	for my $row ( $row_min .. $row_max ) {
		for my $col ( $col_min .. $col_max ) {

			my $cell = $worksheet->get_cell( $row, $col );
			next unless $cell;

			print "Row, Col    = ($row, $col)\n";
			print "Value       = ", $cell->value(),       "\n";
			print "Unformatted = ", $cell->unformatted(), "\n";
			print "\n";
		}
	}
	last;# In order not to read all sheets
}

###########################
# SYNOPSIS Screen Output
# 01: Row, Col    = (0, 0)
# 02: Value       = Category
# 03: Unformatted = Category
# 04: 
# 05: Row, Col    = (0, 1)
# 06: Value       = Total
# 07: Unformatted = Total
# 08: 
# 09: Row, Col    = (0, 2)
# 10: Value       = Date
# 11: Unformatted = Date
# 12: 
# 13: Row, Col    = (1, 0)
# 14: Value       = Red
# 16: Unformatted = Red
# 17: 
# 18: Row, Col    = (1, 1)
# 19: Value       = 5
# 20: Unformatted = 5
# 21: 
# 22: Row, Col    = (1, 2)
# 23: Value       = 2017-2-14
# 24: Unformatted = 41318
# 25: 
# More intermediate rows ... 
# 82: 
# 83: Row, Col    = (6, 2)
# 84: Value       = 2016-2-6
# 85: Unformatted = 40944
###########################

DESCRIPTION

This is another module for parsing Excel 2007+ workbooks. It is designed to use XML::LibXML and in this iteration only has an <XML::LibXML::Reader> parser. Future iterations could include a DOM parser option. The goal of this package is three fold. First, adhere more closely to the Spreadsheet::ParseExcel API so that less work would be needed to integrate both of them. Second, to provide an XLSX sheet parser that is built on XML::LibXML. The other two primary options for XLSX parsing on CPAN use either a one-off XML parser or XML::Twig and it was difficult for me to tell if the bugs I encountered with them were associated with the XML parsers or just inherent to the reader itself. By the time I had educated myself enough to know the difference I had written this. Finally, I wanted to improve date handling so that custom format development and implementation would be easier. I leveraged coercions from Type::Coercion to do this but anything that follows that format will work here.

In the process of learning and building I also wrote some additional features for this parser that are not found in the Spreadsheet::ParseExcel package. Read the details below for more information. For instance in the SYNOPSIS the '$parser' and the '$workbook' are actually the same class. You could combine both steps by calling new with the 'file_name' attribute called out. Afterward it is still possible to call ->error on the instance. On the other hand this package does not yet provide the same access to the formatting elements provided in Spreadsheet::ParseExcel. That is on the longish and incomplete TODO list.

This is a Moose based package. As such it is designed to be (fairly) extensible. Some design departures in this module were driven by Excel's new organization of the XML format but others were Moose driven. .

Attributes

Data passed to new when creating an instance (parser). For modification of these attributes see the listed Methods of the instance. Note the parse method is just a convenience method and is included only for Spreadsheet::ParseExcel compatibility.

error_inst

Definition: This attribute holds an 'error' object instance. It should have has several methods for managing errors.

Currently no error codes or error translation options are available but this should make implementation of that easier.

Default an Spreadsheet::XLSX::Reader::LibXML::Error instance with the attributes set as;

( should_warn => 0 )

Range The minimum list of methods to implement for your own instance is;

error set_error clear_error set_warnings if_warn

file_name

Definition: This attribute holds the full file name and path for the xlsx file to be parsed.

Default no default - this must be provided to read a file

Range any unincrypted xlsx file that can be opened in Microsoft Excel

file_creator

Definition: This holds the information stored in the Excel Metadata for who created the file originally. You shouldn't set this attribute yourself.

Default the value from the file

Range A string

file_date_created

Definition: This holds the created date in the Excel Metadata for when the file was first built. You shouldn't set this attribute yourself.

Default the value from the file

Range A timestamp string (ISO ish)

file_modified_by

Definition: This holds the information stored in the Excel Metadata for who modified the file last. You shouldn't set this attribute yourself.

Default the value from the file

Range A string

file_date_modified

Definition: This holds the last modified date in the Excel Metadata for when the file was last changed. You shouldn't set this attribute yourself.

Default the value from the file

Range A timestamp string (ISO ish)

sheet_parser

Definition: This sets the way the .xlsx file is parsed. For now the only choice is 'reader'.

Default 'reader'

Range 'reader'

count_from_zero

Definition: Excel spreadsheets count from 1. Spreadsheet::ParseExcel counts from zero. This allows you to choose either way.

Default 1

Range 1 = counting from zero like Spreadsheet::ParseExcel, 0 = Counting from 1 lke Excel

file_boundary_flags

Definition: When you request data past the end of a row or past the bottom of the data this package can return 'EOR' or 'EOF' to indicate that state. This is especially helpful in 'while' loops. The other option is to return 'undef'. This is problematic if some cells in your table are empty which also returns undef.

Default 1

Range 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when requesting a position that is out of bounds

empty_is_end

Definition: The excel convention is to read the table left to right and top to bottom. Some tables have uneven columns from row to row. This allows the several methods that take 'next' values to wrap after the last element with data rather than going to the max column.

Default 0

Range 1 = treat all columns short of the max column for the sheet as being in the table, 0 = end each row after the last cell with data rather than going to the max sheet column

from_the_edge

Definition: Some data tables start in the top left corner. Others do not. I don't reccomend that practice but when aquiring data in the wild it is often good to adapt. This attribute sets whether the file reads from the top left edge or from the top row with data and starting from the leftmost column with data.

Default 1

Range 1 = treat the top left corner of the sheet even if there is no data in the top row or leftmost column, 0 = Set the minimum row and minimum columns to be the first row and first column with data

default_format_list

Definition: This is a departure from Spreadsheet::ParseExcel for two reasons. First, it doesn't use the same modules. Second, this accepts a role with two methods where ParseExcel accepts an object instance.

Default Spreadsheet::XLSX::Reader::LibXML::FmtDefault

Range a Moose role with the methods 'get_defined_excel_format' and 'change_output_encoding' it should be noted that libxml2 which is the underlying code for XML::LibXML allways attempts to get the data into perl friendly strings. That means this should only tweak the data on the way out and does not affect the data on the way in.

format_string_parser

Definition: This is the interpreter that turns the excel format string from format_string_parser into a Type::Tiny coercion. If you don't like the output or the method you can write your own Moose Role and add it here.

Default Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings

Range a Moose role with the method 'parse_excel_format_string'

group_return_type

Definition: Traditionally ParseExcel returns a cell object with lots of methods to reveal information about the cell. In reality this is probably not used very much so in the interest of simplifying you can get a cell object instance set to the cell information. Or you can just get the raw value in the cell or you can get the cell value formatted either the way the sheet specified or the way you specify. See the 'custom_formats' attribute for the Spreadsheet::XLSX::Reader::LibXML::Worksheet class to insert custom targeted formats for use with the parser. All empty cells return undef no matter what.

Default instance

Range instance = returns a populated Spreadsheet::XLSX::Reader::LibXML::Cell instance, unformatted = returns the raw value of the cell with no modifications, value = returns just the formatted value stored in the excel cell

Methods

These include methods to adjust attributes as well as providing a methods to load the xlsx file.

parse( $file_name, $formatter )

Definition: This is a convenience method to match the Spreadsheet::ParseExcel equivalent. it should behave the same way setting the error and returning undef for failure.

Accepts:$file_name = of a valid xlsx file ($required), $formatter = see the attribute default_format_list for valid options

Returns: itself when passing with the xlsx file loaded or undef for failure

worksheet worksheets get_error_inst error set_error clear_error set_warnings if_warn set_file_name has_file_name creator modified_by date_created date_modified set_parser_type get_parser_type counting_from_zero set_count_from_zero boundary_flag_setting change_boundary_flag set_empty_is_end is_empty_the_end set_from_the_edge set_default_format_list get_default_format_list set_format_string_parser get_format_string_parser get_group_return_type set_group_return_type get_epoch_year get_shared_string_position get_format_position get_worksheet_names number_of_sheets start_at_the_beginning in_the_list

To install Spreadsheet::XLSX::Reader::LibXML, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Spreadsheet::XLSX::Reader::LibXML

CPAN shell

perl -MCPAN -e shell
install Spreadsheet::XLSX::Reader::LibXML

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

Attributes

error_inst

file_name

file_creator

file_date_created

file_modified_by

file_date_modified

sheet_parser

count_from_zero

file_boundary_flags

empty_is_end

from_the_edge

default_format_list

format_string_parser

group_return_type

Methods

parse( $file_name, $formatter )

Module Install Instructions