NAME

DBIx::AutoUpgrade::NativeStrings - automatically upgrade Perl native strings to utf8 before sending them to the database

SYNOPSIS

use utf8;
use DBI;
use DBIx::AutoUpgrade::NativeStrings;
use Encode;

my $injector = DBIx::AutoUpgrade::NativeStrings->new(native => 'cp1252');
my $dbh = DBI->connect(@dbi_connection_params);
$injector->inject_callbacks($dbh);

# these strings are semantically equal, but have different internal representations
my $str_utf8   = "il était une bergère, elle vendait ses œufs en ¥, ça paie 5¾ ‰ de mieux qu’en €",
my $str_native = decode('cp1252', $str_utf8, Encode::LEAVE_SRC);

# Oracle example : check if strings passed to the database are equal
my $sql = "SELECT CASE WHEN ?=? THEN 'EQ' ELSE 'NE' END FROM DUAL";
my ($result) = $dbh->selectrow_array($sql, {}, $str_native, $str_utf8); # returns 'EQ'

DESCRIPTION

This module intercepts calls to DBI methods for automatically converting Perl native strings to utf8 strings before they go to the DBD driver.

There are two situations where it is useful :

  1. Some DBD drivers do not comply with this DBI specification :

      Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.

    For example with DBD::Oracle v1.83 and with a client charset set to AL32UTF8, native string with characters in the range 128 .. 255 are not converted to utf8 strings; therefore characters in that range become Unicode code points in block C1 control codes, without any graphical display, which is not their intended meaning.

  2. Drivers that do attempt to comply with the DBI specification, like for example DBD::SQLite or DBD::Pg, perform an automatic upgrade of native strings ... assuming that the native character set is iso-8859-1 (Latin-1). However some platforms have different native character sets; in particular, the default "codepage" on Windows machines is Windows-1252, where code points in the range 128-159 are mapped to various graphical characters. So if your native strings assume Windows-1252 encoding, such characters will not be stored correctly within the database server.

With the present module, clients explicitly specify at initialization time what is the native encoding. From that, the module automatically converts native strings to their proper Unicode counterpart before sending them to the database.

Of course this only makes sense when the connection to the database is in Unicode mode. Each DBD driver has its own specific way of setting the character set used for the connection; so be sure to properly tune your DBD driver when using the present module.

METHODS

new

my $injector = DBIx::AutoUpgrade::NativeStrings->new(%options);

Constructor for a callback injector object. Options are :

native

The name of the native encoding. This should be either

  • a valid Perl encoding name, as listed in Encode::Encodings. Strings will be converted through "decode" in Encode;

  • the string 'locale', which will invoke Encode::Locale to automatically guess what is the native encoding;

  • the string 'default', which will use the default Perl upgrading mechanism through "utf8::upgrade" in utf8. This is the default value. It works well for latin-1 (iso-8859-1), but not for other native encodings.

decode_check

A bitmask passed as third argument to "decode" in Encode (see "List of CHECK values" in Encode). Default is undef.

debug

An optional coderef that will be called as $debug->($message). Default is undef. A simple debug coderef could be :

my $injector = DBIx::AutoUpgrade::NativeStrings->new(debug => sub {warn @_, "\n"});
dbh_methods

An optional arrayref containing the list of $dbh method names that will receive a callback. The default list is :

do
prepare
selectrow_array
selectrow_arrayref
selectrow_hashref
selectall_arrayref
selectall_array
selectall_hashref
selectcol_arrayref
sth_methods

An optional arrayref containing the list of $sth method names that will receive a callback. The default list is :

bind_param
bind_param_array
execute
execute_array
bind_type_is_string

An optional coderef that decides what to do with calls to the ternary form of "bind_param" in DBI, i.e.

$sth->bind_param($position, $value, $bind_type);

If $coderef->($bind_type) returns true, the $value is treated as a string and will be upgraded if needed, like arguments to other method calls; if the coderef returns false, the $value is left intact.

The default coderef returns true when the $bind_type is one of the DBI constants SQL_CHAR, SQL_VARCHAR, SQL_LONGVARCHAR, SQL_WLONGVARCHAR, SQL_WVARCHAR, SQL_WCHAR or SQL_CLOB.

inject_callbacks

$injector->inject_callbacks($dbh);

Injects callbacks into the given database handle. If that handle already has callbacks for the same methods, the system will arrange for those other callbacks to be called after all string arguments have been upgraded to utf8.

ARCHITECTURAL NOTES

Object-orientedness

Although I'm a big fan of Moose and its variants, the present module is implemented in POPO (Plain Old Perl Object) : since the object model is extremely simple, there was no ground for using a sophisticated object system.

Strings are modified in-place

String arguments to DBI methods are modified in-place. It is unlikely that this would affect your client program, but if it does, you need to make your own string copies before passing them to the DBI methods.

Possible redundancies

DBI does not precisely document which of its public methods call each other. For example, one would think that execute() internally calls bind_param(), but this does not seem to be the case. So, to be on the safe side, callbacks installed here make no assumptions about string transformations performed by other callbacks. There might be some redundancies, but it does no harm since strings are never upgraded twice.

Caveats

The bind_param_inout() method is not covered -- the client program must do the proper updates if that method is used to send strings to the database.

AUTHOR

Laurent Dami, <dami at cpan.org>

COPYRIGHT AND LICENSE

Copyright 2023 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.