NAME

Text::CSV::Separator - Determine the field separator of a CSV file

SYNOPSIS

  use Text::CSV::Separator qw(get_separator);
  
  my @char_list = get_separator(
                                  path => $csv_path,
                                  exclude => $array1_ref, # optional
                                  include => $array2_ref, # optional
                               );

  
  my $char_count = @char_list;
  
  my $separator;
  if ($char_count == 1) {       # successful detection, we've got a winner
    $separator = $char_list[0];
    
  } elsif  ($char_count > 1) {  # several candidates passed the tests
    warning message or any other action
    
  } else {                      # none of the candidates passed the tests
    warning message or any other action
    
  }

DESCRIPTION

This module provides a fast detection of the field separator character (also called field delimiter) of a CSV file, or more generally, of a character separated text file (also called delimited text file), and returns it ready to use in a CSV parser (e.g., Text::CSV_XS, Tie::CSV_File, or Text::CSV::Simple). This may be useful to the vulnerable population of programmers who need to process automatically CSV files from different sources.

The default set of candidates contains the following characters: ',' ';' ':' '|' '\t'

The only required parameter is the CSV file path. Optionally, the user can specify characters to be excluded or included in the list of candidates.

The routine returns an array containing the list of candidates that passed the tests. If it succeeds, this array will contain only one value: the field separator we are looking for.

The technique used is based on the following principle:

  • For every line in the file, the number of instances of the separator character acting as separators must be an integer constant > 0 , although a line may also contain some instances of that character as escaped literal characters.

  • Most of the other candidates won't appear in a typical CSV line.

The candidates will be removed from the candidates list as soon as they miss a line.

This is the first test done to the CSV file. In most cases, it will detect the separator after processing the first few lines. In particular, if the file contains a header line, one line will probably be enough to get the job done. Processing will stop and return control to the caller as soon as the program reaches a status of 1 single candidate (or 0 candidates left).

If the routine cannot determine the separator in the first pass, it will do a second pass based on a heuristic technique: even if the other candidates appear in every line, their count will likely vary significantly in the different lines. So it measures the variability of the remaining candidates and returns the list of possible separators sorted by their likelihood, being the first array item the most probable separator. Since this is a rule of thumb, you can always create a CSV file that breaks this logic. Nevertheless, it will work correctly in many cases. The possibility of excluding some of the default candidates may help to resolve cases with several possible winners.

EXAMPLE

Consider the following scenario: Your program must process a batch of csv files, and you know that the separator could be a comma, a semicolon or a tab. You also know that one of the fields contains time values. This field will provide a fixed number of colons that could mislead the detection code. In this case, you should exclude the colon (and you can also exclude the other default candidate not considered, the pipe character):

  my @char_list = get_separator(path => $csv_path, exclude => [':', '|']);

  my $char_count = @char_list;
  
  my $separator;
  if ($char_count == 1) {       
    $separator = $char_list[0];
  } 
  ...

MOTIVATION

Despite the popularity of XML, the CSV file format is still widely used for data exchange between applications, because of its much lower overhead: it requires much less bandwidth and storage space than XML, and it also has a better performance under compression.

Unfortunately, there is no formal specification of the CSV format. The Microsoft Excel implementation is the most widely used and it has become a de facto standard, but the variations are almost endless.

One of the differences is the field separator character used. CSV stands for "comma-separated values", but most of the spreadsheet applications let the user select the field delimiter from a list of several different characters when saving or exporting data to a CSV file. Furthermore, in a Windows system, when you save a spreadsheet in Excel as a CSV file, Excel will use as the field delimiter the default list separator of your system's locale, which happens to be a semicolon for several European languages. You can even customize this setting and use the list separator you like. That's why this particular difference can make it really hard to process heterogeneous CSV files automatically.

This module can be used to determine the separator character of a delimited text file of any kind, but since the aforementioned ambiguity problems occur mainly in CSV files, I decided to use the Text::CSV:: namespace.

EXPORT

None by default.

SEE ALSO

There's another module in CPAN for this task, Text::CSV::DetectSeparator, which follows a different approach.

AUTHOR

Enrique Nell, <enell@cpan.org>

ACKNOWLEDGEMENTS

Many thanks to Xavier Noria for wise suggestions.

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Enrique Nell.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.