NAME
Regexp::Common::debian - regexps for Debian specific strings
SYNOPSIS
use Regexp::Common qw(debian);
#TODO:
DESCRIPTION
Debian GNU/Linux as a management system validates, parses, and generates a lots of data. For sake of some other project I've needed some kind of parser. Part of Debian package management system, namely it's generating part -- dpkg-deb, is written in Perl, but... The API is provided in source-code form -- no docs, no plans, we are unstable. What morons. I've needed something I could depend on. I'm not about code, I'm about API.
So I've gone myself. I believe, that Perl-way of doing such things is packing re-used and intented for re-use in module. And if such module is made anyway, why I shouldn't share it? (hmm, I've already told that someone...) So here we are -- Regexp::Common::debian (applauses, thanks, thanks).
When choosing API I would provide I had an option --
- parsing
-
That would be a bunch of error-prone decisions -- pick a backbone parser, figure out grammar, mix them, build API, implement it,.. And as a net result one more xDpkg:: namespace. I really would like to hear any reasons why.
- comparing
-
String on left, regexp on right, add {-keep}, and get an array of parsed out parts. Other way: string on left, regexp on right, anchor it properly, and get a scalar indicating match/mismatch. The only deficiency I can see is that result is an array, but hash. Hard to argue. That seems I've committed a sin. Should live with it.
As a backbone Regexp::Common was chosen. It has it's own deficiences, it's dead-upstream, but I've failed to find any unhappy user (unsatisfied -- maybe, but unhappy -- no, sir). Maybe I didn't tried hard enough. It provides neat and rich interface, but...
{-keep} and {-i} are provided internally. It's OK with {-keep}, but {-i}... Look, Debian strings are almost all case-sensitive. When case shouldn't matter it's explicitly switched off by template itself. So -- if you play with {-i}, don't blame me then. (I'll experiment with implicit qr/(?i:)/
after that release.)
(note) Regexp::Common::debian is very permissive in some cases (sometime absurdly permissive). Hopefully, I've noted in docu all such cases. For next release I'm going to implement verification against all stuff found on site. That, hopefully, will enable stricting patterns while accepting real life.
- $RE{debian}{package}
-
'the-very.strange.package+name' =~ $RE{debian}{package}{-keep}; print "package is $1";
This is Debian package name. Rules are described in Section 5.6.7 of Debian policy.
- $RE{debian}{version}
-
'10:1+abc~rc.2-ALPHA-rc25+w~t.f' =~ $RE{debian}{version}{-keep}; $2 eq '10' && $3 eq '1+abc~rc.2-ALPHA' && $4 eq 'rc25+w~t.f' or die;
This is Debian version. Rules are described in Section 5.6.12 of Debian policy.
- $1 is a debian_version
- $2 is an epoch
-
if any. Oterwise --
undef
. - $3 is an upstream_version
-
(caveat) A string like
0--1
will end up with $3 set to weird0-
(hopefully, Debian won't degrade to such versions; though YMMV). - $4 is a debian_revision
-
(bug)
0-1-
will end up with $3 set to0
and $4 set to1
(such trailing hyphens will be missing in $1).0-
will end up with $4undef
ed.
(bug) Either I don't perlre or I didn't tried hard enough. Anyway, I didn't find a way to parse Debian version the way R::C requires in context of perl5.8.8 (perl in stable, going to be oldstable).
qr/(?|)/
saved perl5.10.0 (but see "R_C_d_version").(caveat) The debian_revision is allowed to start with non-digit. This's solely my reading of Debian Policy.
- R_C_d_version
-
use Regexp::Common qw(debian); # though that works too # use Regexp::Common::debian; my $re = Regexp::Common::debian::R_C_d_version; $version =~ /^$re$/; $2 and print "has epoch\n"; $3 || $5 || $6 || $8 and print "has upstream_version\n"; $4 || $7 and print "has debian_revision\n"; $3 && !$4 || !$3 && $4 or die; $6 && !$7 || !$6 && $7 or die; $3 && !$5 && !$6 && !$8 or die; $5 && !$6 && !$8 or die; $6 && !$8 or die;
That's a workaround for perl5.8.8 (read "$RE{debian}{version}" (look for (bug))). Look for (caveat) in "$RE{debian}{version}" -- those apply here too.
- $1 is debian_version again
- $2 is epoch always
- Either $3, or $5, or $6, or $8 is upstream_version
- Either $4 or $7 is debian_revision
That's the best what can be done with RE (in real world it's done functional way). Sorry.
(bug) It always grabs (should be configurable with setting like -keep). OTOH, look, within 2year (or so) (as soon as perl5.10.0 would be oldstable) that dirty piece will be dropped anyway.
(note) &R_C_d_version is unexported function because that follows Regexp::Common way of providing regexps -- each time you've got a new
qr//
, but reference. It's unexported for obvious reason. - $RE{debian}{architecture}
-
$arch =~ $RE{debian}{architecture}{-keep}; $2 && ($3 || $4) and die; $3 && !$4 and die; $3 && $4 eq 'armel' and die; $2 and print "that's special: $2"; $3 and print "OS is: $3"; $4 and print "arch is: $4";
This is Debian architecture. Rules are described in Section 5.6.8 of Debian policy.
- $1 is some of Debian's architectures
- $2 is any special
-
Distinguishing special architectures (
all
,any
, andsource
) and os-arch pairs is arguable. But I've decided that would be good to separateall
and e.g.i386
(what in turn is actuallylinux-i386
). - $3 is os
-
When
!$3 && $4
is true then undefined $3 actually meanslinux
. Since $digits are read-only yielding here anything butundef
is impossible. More on that in Section 11.1 of Debian policy. - $4 is arch
-
Please note that there are architectures which are present only for
linux
os (namelyarmel
andlpia
, at time of writing).
(caveat) Debian policy by itself doesn't specify what os-arch pairs are valid (only specials are mentioned). In turn it relies on
qx/dpkg-architecture -L/
. In effect R::C::d can desinchronize; Hopefully, that wouldn't stay unnoticed too long. - $RE{debian}{archive}{binary}
-
'abc_1.2.3-512_all.deb' =~ $RE{debian}{archive}{binary}{-keep}; print " package is -> $2"; print " version is -> $3"; print "architecture is -> $4";
This is Debian binary archive (even if there's no binary file (in -B sense) inside it's called "binary" anyway). The naming convention isn't described in Debian policy; Instead it refers to format understood by dpkg (Preface of Chapter 3). (Hopefully, someday here will be references to code inside dpkg and dpkg-deb codebase that does those nasty things with package, version, and arch composing in and decomposing out of filenames.)
- $1 is deb-filename
-
That's the whole archive filename with
.deb
suffix included - $2 is package
- $3 is version
-
There's a big deal of WTF. Filename: in *_Packages miss epoch at all. Archives in pool/ miss them too. Archives in /var/cache/apt/archives ... That seems to be
apt-get
specific (I don't have reference to code though). As a feature $RE{d}{a}{binary} provides an epoch hack in filenames. - $4 is architecture
-
(caveat) That would match surprising
source
orany
. Sorry. That'll improve in future. Actually that's even worse: OS can prepend any arch or special.
For the sake of symmetry $RE{d}{a}{binary} has trailing anchor -- negative look-ahead for any character that can be found in version string.
- $RE{debian}{archive}{source}
-
'xyz_1-ab.25~6.orig.tar.gz' =~ $RE{debian}{archive}{source}{-keep}; print "package is $2"; index($3, '-') && $4 eq 'tar' and die; $4 eq 'orig.tar' and "print there should be patch";
This is Debian upstream (or Debian-native) source tarball. Naming source archives is outside Debian policy; although
Section 5.6.21 mentions that "the exact forms of the filenames are described in" Section C.3.
Section C.3 points that source archive must be in form package_upstream-version.orig.tar.gz.
Naming Debian-native packages is left completely.
dpkg-source(1) (1.14.23) in Section SOURCE PACKAGE FORMATS mentions some bits of naming (Debian-native packages are left too).
Welcome to the real life. $RE{d}{a}{source} knows only Format: 1.0 naming.
- $1 is tarball-filename
-
Since there's no other suffix, but .gz it's present only in $1
- $2 is package
- $3 is version
- $4 is type
-
This can hold one of 2 strings (
orig.tar
(regular package) ortar
(Debian-native package)).
Since dot (
.
) is used as separator and can be in version the whole thing is implicitly anchored (negative-lookahead for version-forming character) (The idea is that0.orig.tar.gz
can be a very strange version) and version itself is stressed to be as short as possible. - $RE{debian}{archive}{patch}
-
'abc_0cba-12.diff.gz' =~ $RE{debian}{archive}{patch}{-keep}; print "package is $2"; -1 == index $3, '-' and die; print "debian revision is ", (split /-/, $3)[-1];
This is "debianization diff" (Section C.3 of Debian policy). Naming patches is outside Debian policy; So we're back to guessing. There're rumors (or maybe trends) that Format 1.0 will be deprecated (or maybe obsolete).
- $1 is patch-filename
-
Since there's no other suffix, but .diff.gz it's present only in $1
- $2 is package
- $3 is version
-
(caveat) Consider this. A Debian-native package misses a patch and hyphen in version. A regular package has a patch and must have hyphen in version. $RE{d}{a}{patch} is absolutely ignorant about that (we are about matching but verifying after all).
The very same considerations covered in discussion trailing $RE{d}{a}{source} entry apply to $RE{d}{a}{patch} as well (consider:
0.diff.gz
can be a version). - $RE{debian}{archive}{dsc}
-
'abc_0cba-12.dsc' =~ $RE{debian}{archive}{dsc}{-keep}; print "package is $2"; print "version is $3";
This is "Debian source control" (Section 5.4 describes its contents but naming). Statistically based guessing, you know (once I'll elaborate to point exact lines in dpkg-dev bundle where it's in use (creating and parsing)).
- $1 is dsc-filename
-
As usual, since the only suffix can be .dsc it's present in $1 only.
- $2 is package
- $3 is version
blah-blah refering to $RE{d}{a}{source} (consider:
0.dsc
can be version). - $RE{debian}{archive}{changes}
-
'abc_0cba-12.changes' =~ $RE{debian}{archive}{changes}{-keep}; print "package is $2"; print "version is $3";
This is "Debian changes file" (Section 5.5 describes its contents but naming). Statistically based guessing, you know (once I'll elaborate to point exact lines in dpkg-dev bundle where it's in use (creating and parsing)) (should be a template).
- $RE{debian}{sourceslist}
-
'deb file:/usr/local oldstable main contrib non-free' =~ $RE{debian}{sourceslist}{-keep} and system "rm -rf $5" or die; ($4 eq 'http' || $4 eq 'rsh' || $4 eq 'ssh') && !index $5, '//' or die; ($4 eq 'file' || $4 eq 'cdrom' || $4 eq 'copy') && !index($5, '/') && index($5, '/', 1) > 1 or die; index(reverse($6), '/') || $7 or die;
This is one entry in sources.list resource list. The format is described in sources.list(5) man page (hence a chance for desincronization provided).
- $1 is resource_entry
-
$RE{d}{sourceslist} is very permissive about what would constitute entries, but you can bet on -- the whole entry stays on one line.
- $2 is resource_type
-
That can be either
deb
ordeb-src
. Implicit negative lookbehind forqr/\w/
provided (so=deb
is accepted,_deb
is not; hey,#deb
is accepted too! explicit anchoring at your option). - $3 is uri
-
You think you know what URI is? Read below...
- $4 is scheme
-
Scemes that APT knows have nothing to do with sources.list(5) actually. scheme that APT will use is some executable in /usr/lib/apt/methods (some of them are for transfer, some are not). sources.list(5) (of
lenny
) defines these:Delimiting colon
:
isn't included here (although uri does). - $5 is hier_path
-
The idea is that someday $RE{d}{sourceslist} would look behind at uri to decide if there should be authority (that one delimited with
//
) or path_absolute would be enough. Right now that's not the case. (bug) Any non-space sequence is hier_path.That's very bad, but that's the way it's done right now. Look, parsing URI is a task for standalone pattern. It's not implemented, maybe someday some kind perlist would do that.
- $6 is distribution
-
Debian is full of surprises. Lots of surprises. You think you know what distribution is, don't you? You missed. distribution can be filesystem path. Since sources.list(5) doesn't mention space escaping techniques I assume spaces aren't allowed; so any no-space is allowed. You think that's an overkill? You're obviously wrong (think
$ARCH
, sources.list(5) has more). - $7 is component_list
-
In misguided attempt not to make them too different with all that crowd, component_list is space delimited list of non-spaces. If distribution ends with slash (
/
), then component_list can be empty (I've meant, maybe someday that will look-behind too).
All that is quite messy. Can it be improved? Surely yes (even if we stay in Regexp::Common requirements) (think
qr/(?|)/
). And then we have one morev5.10.0
only regexp. Somedayv5.10.0
will be oldstable... - $RE{debian}{preferences}
-
<<END_OF_PREFERENCE =~ $RE{debian}{preferences{-keep}} or die; Explanation: Stay updated! Package: perl Pin: version 5.10* Pin-Priority: 1001 END_OF_PREFERENCES $2 eq 'perl' and print "good, we are looking for perl\n"; $3 eq 'version' and $4 =~ /^5\.10/ and print "good, we are looking for recent\n"; $5 =~ /^\d+$/ && $5 > 1000 and print "good, we'll stay updated\n";
This is one entry in preferences list. Good news are over, bad news are below. I've failed to find definition of entry in preferences (still looking). apt_preferences(5) suggests on what that looks like providing examples. It's not enough;
apt-cache policy
behaviour leads from understanding either.After some experimenting I've found that: In general this is Debian control file format. With some quirks provided. Mine problem isn't how to implement that post-processing with REs: mine problem is what those quirks are! Either I figuring out the format, or releasing. So here we are -- some common case of entry in preferences.
Shortly:
each entry consists of 3 stanzas (
Package:
,Pin:
,Pin-Priority:
);the order matters, no intermediate stanzas allowed;
case doesn't matter (for both name and value of stanza (to some degree));
whatever has gone before
Package:
or came afterPin-Priority:
(line-wise) is ignored;apt-cache policy
fails in one case --Package:
stanza has leading spaces;misparsed values are ignored, thus invalidating the whole entry (but see below), thus the entry is ignored.
That's what $RE{debian}{preferences} does. More on each stanza below.
(bug)
apt-cache policy
will accept newlines -- those are spaces in Debian control files, while consequent lines proper indentation provided. $RE{d}{preferences} accepts one line stanzas only.- $1 is a preferences_entry
-
That's the whole entry -- with all leading and trailing spaces, and an Easter Eggs. apt_preferences(5) invents something called
Explanation:
stanzas (they should go beforePackage:
, with no empty lines in between). Since we are aware of that,Explanation:
sequence is provided in $1 (and it won't be ever $2 (1st, obvious compatibility reasons; 2nd, it's somewhat legalized since it's mentioned; 3rd, it can be easily dropped in case I found that useful)). - $2 is a package_stanza
-
That's either
*
(star, match-any-string wildcard) or space separated list of package names (alone package name is degenerated list). That is, if package_stanza is a list, than each (even if there's only one) non-space sequence is treated as package name.apt-cache policy
doesn't verifies its input, so one can put here anything. Then those sequences will be matched literally against known package names.(feature) In contrary with everything else, in $RE{d}{preferences}, package names are case-sensitive.
(bug)
apt-cache policy
will silently accept star among package names. Then, since no-one package name matches (there can't be a package named*
) the star will be missing among pinned packages. $RE{d}{preferences} rejects such string. - $3 is a context_switch
-
Pin:
stanza is broken in two parts. That's the first one. One of 3 acceptable strings areversion
,origin
, orrelease
. Bad news below. - $4 is a context_filter
-
(bug) (what else?) What would be a correct input here depends on $3. $RE{d}{preferences} takes anything up to the next newline.
- $5 is a pin_priority_stanza
-
In $5 will be a sequence of decimal numbers (yes, hexadecimals are rejected and octals aren't converted), optionally prepended with
+
(plus) or-
(minus) signs up to surprising.
(dot). Any trailing decimals and dots (after the first one) will be ignored byapt-cache policy
. So does the $RE{d}{preferences} too. The optional dot-decimal trailer will be missing in $5, but present in $1.
It's a mess, isn't it? Go figure.
- $RE{debian}{changelog}
-
<<END_OF_CHANGELOG =~ $RE{debian}{changelog{-keep}} or die; perl (6.0.0-1) unstable; urgency=high * Hourah! -- John Doe <doe@example.tld> Thu, 01 Apr 2010 00:00:00 +0300 END_OF_CHANGELOG print <<"END_OF_REPORT" package : $2 version : $3 in archive : $4 flags : $5 changes : ${6}uploaded by : $7 achknowledgment: $8 at time : $9
This is one entry in debian/changelog. The format is described in Section 4.4 of Debian Policy. In real world parsing of this file is done by special Perl module (I'm not aware of implementations in other languages) or dpkg-parsechangelog (of dpkg-dev package (that in turn is Perl script, again)).
There're 2 special Perl modules (namely: Debian::ParseChangelog, and, you knew it, Dpkg::Changelog). And now there'is 3rd one (how cute). Those former are read/write engine, $RE{debian}{changelog} is read-only obviously. There's a point of desincronization though.
Section 4.4.1 of Debian Policy makes provisions for injecting debian/changelog in different (alternate) format. To achieve that, one should provide suitable parser. At time of writing I'm unaware of such alternatives. (However, I'm aware of [489460@bugs.debian.org] (wishlist, pending, 2008-07-05); let's wait.)
- $1 is a changelog_entry
-
That's the whole entry with trailing newline and otherwise skipped empty lines. That trailing newline is the one terminating the last line; entry separating newlines are ignored by this regexp.
- $2 is a debian_package
-
That's a simplified version -- sequence of characters allowed in Debian package name.
- $3 is a debian_version
-
That's a simplified too. For some weird reason debian_version should start with a number. Surrounding braces aren't included.
- $4 is a distributions
-
That's space (
a .. z
) and hiphens (-
) in any order, except first character should be letter (weird). Space before terminating semicolon is disallowed. Terminating semicolon isn't included. - $5 is keys (or urgency, if you like)
-
(note) Debian Policy explicitly states that that field is supposed to be a comma (
,
) separated list of equals (=
) separated key-value pairs. However the only known key isurgency
. Maybe I'm too pesimistic, but despite the fact that the only key allowed isurgency
the whole key=value pair is put in $5 -- so you've better be prepared and pick a key you're looking for (one day you can get a lot more).(caveat) (v.0.1.5) I wasn't enough pessimistic. perl5.8.8 goes nuts sometimes looking for
urgency
(it happens to be an anchor) (namely:libcompress-zlib-perl_2.015-1
) (perl5.10.0 is OK). In misguided attempt to support oldstable (yes, it's oldstable already) $RE{d}{changelog} no more looks forurgency
, it looks for a sequence of lowercase letters. Sorry. - $6 is changes
-
That invents concept of empty line.
(v.0.1.5) For $RE{d}{changelog} "empty line" consists of any number horizontal spaces (space (
"\t"
)) followed by newline. OTOH, "line" is 2 horizontal spaces, any non-space character, and anything up to next newline (space counts as "anything" too). No or 1 space followed by non-space fails entirely (but watch for trailing signature line). As requested by Debian Policy (or stock parser) leading and trailing empty lines are ignored (they are included in $1 though).(bug) Any sequence of trailing 3 or more horizontal spaces is included in $6. (Looking at test-suite: handling of trailing empty lines by $RE{d}{changelog} is a way broken.)
(note) (I can't say is it a bug or feature) The recommended way of outlineing changes is starting each subentry with star (
*
), then adding at least one space to sub-subentries. $RE{d}{changelog} doesn't go that far.(note) (I can't say is it a bug or feature) The leading and trailing empty lines are said to be optional. However one leading and one trailing empty line are present in each (decent?) entry in Debian changelog file. $RE{d}{changelog} doesn't insist on that.
- $7 is a maintainer_name
-
$RE{d}{changelog} is very permissive about what is maintainer_name (and what it is actually?). $8 and $9 take care of themselves. A leading double-hyphen and space and separating space aren't included.
- $8 is an email_address
-
That one (with option to maintainer_address) is subject to be processed with Regexp::Common::Email::Address (or not, under consideration). Anyway, right now it's a sequence of non-spaces surrounded by angle brackets. Surrounding brackets aren't included.
- $9 is a changelog_date
-
That one is subject to be processed with Regexp::Common::Time. Anyway, right now it's a sequence of RFC822-date forming characters, starting with capital letter and terminated with decimal number. Neither leading double-space nor trailing newline isn't included.
Pity on me.
BUGS AND CAVEATS
Grep this pod for (bug)
and/or (caveat)
. They all are placed in appropriate sections.
AUTHOR
Eric Pozharski, <whynot@cpan.org>
COPYRIGHT AND LICENSE
Copyright 2008, 2009 by Eric Pozharski
This library is free in sense: AS-IS, NO-WARANRTY, HOPE-TO-BE-USEFUL. This library is released under LGPLv3.
SEE ALSO
Regexp::Common, http://www.debian.org/doc/debian-policy, sources.list(5), apt_preferences(5), dpkg-parsechangelog(1),