NAME
xml_grep2 - grep XML files looking for specific elements
SYNOPSYS
xml_grep2 [options] xpath_expression [FILE...]
DESCRIPTION
xml_grep2
is a grep-like utility for XML files.
It mimicks grep as much as possible with the major difference that the patterns are XPath expressions instead of regular expressions.
When the results of the grep is a list of XML nodes (ie no option that causes the output to be plain text is used) then the output is normally a single XML document: results are wrapped in a single root element (xg2:result_set
). When several files are grepped, the results are grouped by file, wrapped in a single element (xg2:file
) with an attribute (xg2:filename
) giving the name of the file.
OPTIONS
- -c, --count
-
Suppress normal output; instead print a count of matching lines for each input file.
- --help
-
Display help message
- -f NUM, --format NUM
-
Format, of the output XML
The format parameter sets the indenting of the output. This parameter is expected to be an integer value, that specifies that indentation should be used. The format parameter can have three different values if it is used:
If NUM is 0, than the document is dumped as it was originally parsed
If NUM is 1, xml_grep2 will add ignorable whitespaces, so the nodes content is easier to read. Existing text nodes will not be altered
If NUM is 2 (or higher), xml_grep2 will act as $format == 1 but it add a leading and a trailing linebreak to each text node.
xml_grep2 uses a hardcoded indentation of 2 space characters per indentation level. This value can not be altered on runtime.
- -g, --generate-empty-set
-
Generate an XML result (consisting of only the wrapper) even if no result has been found
- -H, --wrap, --with-filename
-
Force results for each file to be wrapped, even if only 1 file is grepped.
Results are normally wrapped by file only when 2 or more files are grepped
When the
-t
,--text
option is used, prints the filename for each match. - -h, --nowrap, --no-filename
-
Suppress the wrapping of results by file, even if more than one file is grepped.
When the
-t
,--text
option is used, suppress the prefixing of filenames on output when multiple files are searched. - --html
-
Parses the input as HTML instead of XML
- -L, --files-without-matches
-
Suppress normal output; instead print the name of each input file from which no output would normally have been printed. Note that the file still needs to be parsed and loaded.
- -l, --files-with-matches
-
Suppress normal output; instead print the name of each input file from which output would normally have been printed. Note that the file still needs to be parsed and loaded.
- --label LABEL
-
Displays input actually coming from standard input as input com- ing from file LABEL. This is especially useful for tools like zgrep, e.g. gzip -cd foo.xml.gz | xml_grep --label=foo.xml something
- -M, --man
-
Display long help message
- -m NUM, --max-count NUM
-
Output only NUM matches per input file. Note that the file still needs to be parsed and loaded.
- -N PREFIX=URI, --define-ns PREFIX=URI
-
Defines a namespace mapping, that can then be used in the XPath query.
This is the only way to query elements (or attributes) in the default namespace.
XML::LibXML::XPathContext
needs to be installed for this option to be available.Several -N, --define-ns options can be used
- -n STRING, --namespace STRING
-
Change the default namespace prefix used for wrapping results. The default is
xg2
. Use an empty string-n ''
to remove the namespace altogether.If a namespace (default or otherwise) is used, it is associated to the URI
http://xmltwig.com/tools/xml_grep2/
- -o, --original-encoding
-
Output results in the original encoding of the file. Otherwise output is in UTF-8.
The exception to this is when the -v, --invert-match option is used, in which case the original encoding is used.
If the result is an XML document then the encoding will be the encoding of the first document with hits.
Note that if grepping files in various encodings the result could very well be not well-formed XML.
Without this argument all outputs are in UTF-8.
- -q, --quiet, --silent
-
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected. Also see the -s or --no-messages option.
When also using the
-v
or--invert-match
option, the return status will be an error if all the document root (or all the entire document) have been matched. - -R, -r, --recursive
-
Read all files under each directory, recursively
- --include PATTERN
-
Recurse in directories only searching file matching PATTERN.
- --exclude PATTERN
-
Recurse in directories skip file matching PATTERN.
- -s, --no-messages
-
Suppress error messages about nonexistent or unreadable files.
- -t, --text-only
-
Return the result as text (using the XPath value of nodes). Results are stripped of newlines and output 1 per line.
Results are in the original encoding for the document.
- -V, --version
-
Print the version number of xml_grep2 to standard error. This version number should be included in all bug reports (see below).
- -v, --invert-match
-
Return the original document without nodes matching the pattern argument Note that in this mode documents are output on their original encoding.
- -x, no-xml-wrap
-
Suppress the output of the XML wrapper around the XML result.
Useful for exemple when returning collection of attribute nodes.
This option is activated by default when the
-v
option is used (use-X
or to force the XML wrapping in this case) - -X, xml-wrap
-
Forces the use of the XML wrapper around the output, when
-v
is used.
Differences with grep
There are some differences in behaviour with grep that are worth being mentioned:
- files are always parsed and loaded in memory
-
This is inevitable due to the radom-access nature of XPath
- the file list is built before the grepping start
-
This means that warnings about permission problems are reported all at once before the results are output.
BUGS, TODO
- namespace problems
-
When a namespace mapping is defined using the -N, define-ns option, if this prefix is found in a document, even bound to a different namespace, it will match.
When a prefix is defined using the -N, define-ns option, if the prefix is found in a file, then the one defined on the command line will not match for this file
- Encoding
-
Avoid outputing characters outside of the basic ASCII range as numerical entities
Allow encoding conversions
- XML parsing errors
-
Deal better with malformed XML, probably through an option to skip malformed XML files without dying
- Be more compatible with
grep
-
Do not build the list of files up front. Report bad links.
- package properly, more tests, more docs...
XPath
see http://www.w3.org/TR/xpath/ for the spec
see http://zvon.org/xxl/XPathTutorial/General/examples.html for a tutorial
EXAMPLES
- xml_grep2 //h1 index.xhtml
-
Extract
h1
elements fromindex.xhtml
. Do not forget the//
or you will not get any result. - xml_grep2 '//h1|//h2' index.xhtml
-
Extract
h1
andh2
elements fromindex.xhtml
. The expression needs to be quoted because the|
is special for the shell. - xml_grep2 -t -h -r --include '*.xml' '//RowAmount' /invoices/
-
Get the content (-t) of all
RowAmount
elements in.xml
files in theinvoices
directory (and sub-directories)The result will be a text stream with 1 result perl line. The -h option suppresses the display of the file name at the beginning of each line.
- xml_grep2 -t -r -h --include '*.xml' '//@AmountCurrencyIdentifier' /invoices/
-
Get the value of all
AmountCurrencyIdentifier
attribute in.xml
files in theinvoices
directory (and sub-directories). Piping this tosort -u
will give you all the currencies used in the invoices. - xml_grep2 -v '/p[@class="classified"]' secret.xml > pr.xml
-
Remove all
p
elements in theclassified
class from the filesecret.xml
- xml_grep2 -t -N d='http://purl.org/rss/1.0/' '//d:title' use.perl.org.rss.xml
-
Extract the text of the titles from the RSS feed for use.perl.org
As the title elements are in the default namespace, the only way to get them is to define a mapping between a prefix and the namespace URI, then to use it.
- GET http://xmltwig.com/index.html | ./xml_grep2 --html -t '//@href' | sort -u
-
Get the list of links in a web page
REQUIREMENTS
Perl 5,
libxml2
XML::LibXML
XML::LibXML::XPathContext for -N, --define-ns option
Pod::Usage;
Getopt::Long;
File::Find::Rule
SEE ALSO
xml_grep
, distributed with the XML::Twig Perl module offers a less powerful but often more memory efficient implementation of an XML grepper.
xsh
(http://xsh.sourceforge.net) is an XML shell also based on libxml2
and XML::LibXML
.
XMLStarlet
(http://xmlstar.sourceforge.net/) is a set of tools to process XML written in C and also based on libxml2
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
AUTHOR
mirod <mirod@cpan.org>