NAME
Mock::Data::Charset - Generator of strings from a set of characters
SYNOPSIS
# Export a handy alias for the constructor
use Mock::Data::Charset 'charset';
# Use perl's regex notation for [] charsets
my $charset = charset('A-Za-z');
... = charset('\p{alpha}\s\d');
... = charset(classes => ['digit']);
... = charset(ranges => ['a','z']);
... = charset(chars => ['a','e','i','o','u']);
# Test membership
charset('a-z')->contains('a') # true
charset('a-z')->count # 26
charset('\w')->count #
charset('\w')->count('ascii') #
# Iterate
my $charset= charset('a-z');
for (0 .. $charset->count-1) {
my $ch= $charset->get_member($_)
}
# this one can be very expensive if the set is large:
for ($charset->members->@*) { ... }
# Generate random strings
my $str= $charset->generate($mockdata, 10); # 10 random chars from this charset
...= $charset->generate($mockdata, { min_codepoint => 1, max_codepoint => 127 }, 10);
...= $charset->generate($mockdata, { size => [5,10] }); # between 5 and 10 chars
...= $charset->generate($mockdata, { size => sub { 5 + int rand 5 }); # same
DESCRIPTION
This generator is optimized for holding sets of Unicode characters. It behaves just like the Mock::Data::Set generator but it also lets you inspect the member codepoints, iterate the codepoints, and constrain the range of codepoints when generating strings.
CONSTRUCTOR
new
$charset= Mock::Data::Charset->new( %options );
$charset= charset( %options );
$charset= charset( $notation );
If you supply a single non-hashref argument to the constructor, it is assumed to be the "notation" string. Otherwise, it is treated as key/value pairs. You may specify the members of the charset by one of the attributes notation
, members
, or member_invlist
, or construct it from the following charset-building options:
- chars
-
An arrayref of literal character values to include in the set.
- codepoints
-
An arrayref of Unicode codepoint numbers.
- ranges
-
ranges => [ ['a','z'], ['0', '9'] ], ranges => [ 'a', 'z', '0', '9' ],
An arrayref holding start/end pairs of characters, optionally with inner arrayrefs for each start/end pair.
- codepoint_ranges
-
Same as
ranges
but with codepoint numbers instead of characters. - classes
-
An arrayref of character class names recognized by perl (such as Posix or Unicode classes).
- negate
-
Negate the membership of the charset as described by
chars
/ranges
/classes
. This applies to the charset-building options, but has no effect on attributes.
The constructor may also be given any of the keys for "generate_opts", which will be moved into that attribute.
For convenience, you may export the "charset" in Mock::Data::Util which calls this constructor.
If you call new
on an object, it carries over the following settings to the new object: max_codepoint
, generator_opts
, member_invlist
(unless chars change).
ATTRIBUTES
notation
A Perl Regex charset notation; the text that occurs between '[...]' in a regex. (Note that if you use backslash notations, like notation => '\w'
, you should either use a single-quoted string, or escape them as "\\w"
.
This returns the same string that was passed to the constructor, if you gave the constructor a regex-notation string instead of more specific attributes. If you did not, a generic-looking notation will be built on demand. Read-only.
min_codepoint
Minimum codepoint to be returned from the generator. Read/write. This is useful if you want to eliminate control characters (or maybe just NULs) in your output.
max_codepoint
Maximum unicode codepoint to be considered. Read-only. If you are only interested in a subset of the Unicode character space, such as ASCII, you can set this to a value like 0x7F
and speed up the calculations on the character set.
str_len
This determines the length of string that will be returned from generate if no length is specified to that function. This may be a plain integer, an arrayref of [$min,$max]
, or a coderef that returns an integer: sub { 5 + int rand 10 }
.
count
The number of members in the set. Read-only.
members
Returns an arrayref of each character in the set. Try not to use this attribute, as building it can be very expensive for common sets like [:alpha:]
(100K members, tens of MB of RAM). Use "member_invlist" or "get_member" instead, when possible, or set "max_codepoint" to restrict the set to characters you care about.
Read-only.
member_invlist
Return an arrayref holding the "inversion list" describing the members of this set. An inversion list stores the first codepoint belonging to the set, followed by the next higher codepoint which does not belong to the set, followed by the next that does, etc. This data structure allows for efficient negation/inversion of the list.
You may write a new value to this attribute, but not modify the existing array.
METHODS
generate
$charset->generate($mockdata, $len);
$charset->generate($mockdata, \%options, $len);
$charset->generate($mockdata, \%options);
Generate a string of characters from this charset. The %options
may override the following attributes: "min_codepoint", "max_codepoint" (but only smaller values), and "str_len". The default length is 1 character.
compile
Return a plain coderef that invokes "generate" on this object.
parse
my $parse_info= Mock::Data::Charset->parse('\dA-Z_');
# {
# codepoints => [ ord '_' ],
# codepoint_ranges => [ ord "A", ord "Z" ],
# classes => [ 'digit' ],
# }
This is a class method that accepts a Perl-regex-notation string for a charset and returns a hashref of the arguments that should be passed to the constructor.
This dies if it encounters a syntax error or any Perl feature that wasn't implemented.
get_member
my $char= $charset->get_member($offset);
Return the Nth character of the set, starting from 0. Returns undef for values greater or equal to "count". You can use negative offsets to index from the end of the list, like in substr
.
get_member_codepoint
Same as "get_member" but returns a codepoint integer instead of a character.
find_member
my ($offset, $ins_pos)= $charset->find_member($char);
Return the index of a character within the members list. If the character is not a member, this returns undef, but if you call it in array context the second element gives the position where it would be found if it was a member.
negate
my $charset2= $charset->negate;
Return a new charset which contains exactly the opposite characters as this one, up to the "max_codepoint" if defined.
union
my $charset3= $charset1->union($charset2, ...);
Merge one or more charsets. The result contains every character of any set, but clamped to the max_codepoint of the current set.
The arguments may also be plain inversion list arrayrefs instead of charset objects.
AUTHOR
Michael Conrad <mike@nrdvana.net>
VERSION
version 0.02
COPYRIGHT AND LICENSE
This software is copyright (c) 2021 by Michael Conrad.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.