Sequence or Array

What is the point of a sequence anyway, why you might wanna use it, and in which cases you don't want to use? Let my try to explain.

Reading a file

Let's say we wanna read a file from disk. We assume it is some kind of format we maybe need to parse or process somehow. Then we use every line to create or generate some accumulated data from it. So overall we do.

  • Read the whole file into memory

  • Do a computation/transformation on every entry

  • Maybe do some kind of accumulation

A pure perl solution would look similar to this.

use File::Slurp;

my @file_content = read_file('...');
my @transform =
    grep { ... }
    map  { ... }
        @file_content;

my $accum = ...;
for my $t ( @transform ) {
    # do something with $t & $accum
}

This kind of code works but it also has some drawbacks.

Memory consumption

The whole file needs to be read into memory. When you have a relative small file this is fine. But this changes when you need to process very large files of several gigabytes in size or query millions of data from a database.

Execution Order

I consider this less of an issue as i am used to it now, but still sometimes it is annoying that the order of grep and map is swapped. The code first map some data, and then it does grep for filtering. Still the order we write or read is reversed. Sometimes i like this order better, sometimes not.

Inefficent

This code also can become inefficent for the CPU. Even though today we have SSD or NVME and disk throughput and access time has greatly improved, from the perspective of a CPU it is usually still waiting for data.

This means that reading a whole file into memory will probably just cause 1% of CPU usage and the processor will likely spin 99% idling doing nothing and waiting for data to arrive into memory.

Once all data are loaded into memory and accessible in the array, computing becomes very fast. But the overall speed of your program still can be slower because you still spent most time reading a file and waiting without your CPU doing anything.

Sq Array

Now let's assume we rewrite the pure perl code we have so far using Array. Because we still use Array it will still have the same disadvantages. But let's see how it will look using Sq.

use Sq;
use File::Slurp;

my $file_content = Array->bless([read_file('...')]);
my $accum =
    $file_content
        ->map(   sub($x) { ... })
        ->filter(sub($x) { ... })
        ->fold($state, sub($t,$state) {
            # do something with $t & $state
        });
Execution Order

One thing that changed is the order in which we write our code. Now it becomes clear that we first map every entry of an array. Then we filter out some elements and finally do a fold that transforms a whole array into something different.

More functions

Perl itself only provides map and grep as functions operating on Arrays, but the Array module shipped with Sq will have around 100 of common Array transformations you can use. Those are common problems you either have to write again and again in Perl yourself, or you just use Sq.

Code readability

Accessing transformations by a name also makes code better understandable. Without such pre-defined functions all you do is basically always again and again traversing arrays with a for loop usually to transform your data.

But the reader has to decipher the meaning of a for loop. For example calling min on an Array will be much more better understandable as trying to understand the for loop that tries to implement the selection of a minimum value.

Speed & Performance

When we talk about this topic very often people only care for the performance in the sense of how much CPU time something take. But programming is not all about the CPU. As a human I also care how much time I need to write and come up with a correct solution that works.

Using Sq can be faster as you just can access a library of common data-transformations, but this library heavily depends on the usage of anonymous functions and calling functions. In today age it seems that the cost of just calling a function is sometimes underestimated. Every function call needs to push values onto a stack, needs jump instructions and so on. Calling functions can become a serious overhead, especially when we have code in loops and process/iterate millions of items.

Still performance mustn't be bad. Functions in Sq are written to be as efficent as possible, usually trying to benchmark things, selection of proper algorithm and so on. So you can depend on having functions that are reasonable fast. The only overhead you usually pay is the price of calling lambda functions.

Maybe your hand written implementation could be faster, maybe not. It is still a good thing relying on written and tested solutions and writing just one line of code instead of trying to unroll every function call to 10+ lines of code again and again.

No forced paradigma

But consider the following thing. Sq does not try to enforce a paradigm on you, it tries to teach you just another paradigm that you maybe like because of it's advantages.

You can start using Sq, create code that is fast written for you, does what it should do and is maybe fast enough. If not, you maybe can Profile your code and always create the Pure Perl versions of those operations that needs to be faster because they are critical.

It is better to only unroll code at places where it really matters.

Line based Code

Now let's fix the problems we have with the first original pure Perl version. Instead of reading a whole file into memory we go through everything line based. Let's compare how the code changes.

We started with the following code.

use File::Slurp;

my @file_content = read_file('...');
my @transform =
    grep { ... }
    map  { ... }
        @file_content;

my $accum = ...;
for my $t ( @transform ) {
    # do something with $accum
}

now we want to iterate line by line and do the transformation, filtering and creating of $accum as efficent as possible. Now we have code like this.

my $fh = open my $fh, '<', '...';

my $accum = ...;
while ( my $line = <$fh> ) {
    # this is basically the map call
    my $transform = transform($line);

    # this is basically grep
    if ( $transform ... ) {
        # do something with $transform + $accum
    }
}

This kind of code fixes the problem we had with the first solution. That means.

Just line based

Instead of reading the whole file into memory we just process it line by line. We could easily process a 10 GiG file this way. It completely depends on $accum and what we create here and how big that data is if our process fails or not. But consider in the first version we would need the whole file into memory + $accum. In this solution it is just $accum + 1 line.

Execution Order

Execution Order is now in order, we first see that we transform, then we filter. But still this code suffers from what i have deescribed above about for loops being hard to read. It becomes a lot harder to decipher as a human reading code what our while loop actually does.

Efficent

This code can be a lot more efficent, especially when we read something from a file. Instead that the processor now waits a lot of time for data to arrive, we immediately execute some code and create/process data to create $accum instead of just waiting for data to arrive.

Reading something line based and directly doing something with it is a great way to speed up all kind of I/O processing, because while our code processes data our operating system is not sleeping. Usually the operating system starts to fill buffers so when a program ask for the next line it immediately can deliver that line without waiting.

This kind of buffers not only exists in different Perl IO Layers but also goes down back to the operating system, so the overal time our program needs to run will very likely become shorter. Instead of idling/waiting we now use that time to already process data.

Sequence based solution

But you know what I didn't liked at all? It's how different we need to write code. I like writing code with map and grep and some kind of helper functions from List::Util and so on. But when we need to write code that cannot read everything into memory at once than we cannot use those functions.

Then suddenly we have to unroll all those functions and write code that looks like primitive C code.

Now let's look what happens when we replace it with Seq.

Here again is the code we had using Array from Sq.

use Sq;
use File::Slurp;

my $file_content = Array->bless([read_file('...')]);
my $accum =
    $file_content
        ->map(   sub($x) { ... })
        ->filter(sub($x) { ... })
        ->fold($state, sub($t,$state) {
            # do something with $t & $state
        });

And here is the sequence based solution that don't need to read the whole file into memory all at once.

use Sq;

my $file_content = Sq->io->open_text('...');
my $accum =
    $file_content
        ->map(   sub($x) { ... })
        ->filter(sub($x) { ... })
        ->fold($state, sub($t,$state) {
            # do something with $t & $state
        });

So what changed? Well not much. Sq->io->open_text returns a sequence that iterates line by line through a file. But Seq has the same API* as Array. So the code looks exactly the same, but it has all the benefits of the Line Based Code.

It might be correct that Seq has some overhead and all that lambda calling makes it slower. But this is only compared to an Array. When we use Seq on a file, socket or any other kind of Slow IO then we just trade a little bit more CPU consumption.

But this is totally fine. The purpose of CPUs is to compute things, not idling around doing nothing.

What if Seq is not fast enough?

Well than you have to unroll the Seq code into pure perl code and try to eleminate any function calling. That's how you speed up everything as much as possible in Perl.

But i consider that using Seq or Sq in general will help you to write code faster, still get a performance gain in most scenarios and transition between different versions are easier. Like switching from Array to Seq.

When something is not fast enough you can Profile your code. Look at Devel::NYTProf for this. And then just unroll the Seq or Array code into pure perl.

But it helps having code written with Seq or Array and then turn it into a series of single steps. Especially when you have a test-suite, tested your abstract code and then re-write some functions for speed.

*Same API

It is not quite true that Seq and Array have the same API. Some operation only makes sense on the one or the other. For example Array provides functions like blitting and also some mutation that are not present on Seq and cannot really implemented for an iterative solution.

Same goes the other way around as Seq has the ability to handle infinite sequences that cannot be handled with an Array.

But this is not an disadvantage. Some operations only really makes sense on different kind of data-structures, when you need them, you can use them.

But most functions exists on both data-structures and you can easily swap between the data-structures. On top also consider that you always can use methods like to_array to transform a sequence to an array if needed, or use Seq->from_array to turn an Array efficently into a Seq.

So it is up to you to pick the right data-structure at the right place to create good performing programs.