Sequence or Array
What is the point of a sequence anyway, why you might wanna use it, and in which cases you don't want to use? Let my try to explain.
Reading a file
Let's say we wanna read a file from disk. We assume it is some kind of format we maybe need to parse or process somehow. Then we use every line to create or generate some accumulated data from it. So overall we do.
Read the whole file into memory
Do a computation/transformation on every entry
Maybe do some kind of accumulation
A pure perl solution would look similar to this.
use File::Slurp;
my @file_content = read_file('...');
my @transform =
grep { ... }
map { ... }
@file_content;
my $accum = ...;
for my $t ( @transform ) {
# do something with $t & $accum
}
This kind of code works but it also has some drawbacks.
- Memory consumption
-
The whole file needs to be read into memory. When you have a relative small file this is fine. But this changes when you need to process very large files of several gigabytes in size or query millions of data from a database.
- Execution Order
-
I consider this less of an issue as i am used to it now, but still sometimes it is annoying that the order of
grep
andmap
is swapped. The code firstmap
some data, and then it doesgrep
for filtering. Still the order we write or read is reversed. Sometimes i like this order better, sometimes not. - Inefficent
-
This code also can become inefficent for the CPU. Even though today we have SSD or NVME and disk throughput and access time has greatly improved, from the perspective of a CPU it is usually still waiting for data.
This means that reading a whole file into memory will probably just cause 1% of CPU usage and the processor will likely spin 99% idling doing nothing and waiting for data to arrive into memory.
Once all data are loaded into memory and accessible in the array, computing becomes very fast. But the overall speed of your program still can be slower because you still spent most time reading a file and waiting without your CPU doing anything.
Sq Array
Now let's assume we rewrite the pure perl code we have so far using Array. Because we still use Array
it will still have the same disadvantages. But let's see how it will look using Sq.
use Sq;
use File::Slurp;
my $file_content = Array->bless([read_file('...')]);
my $accum =
$file_content
->map( sub($x) { ... })
->filter(sub($x) { ... })
->fold($state, sub($t,$state) {
# do something with $t & $state
});
- Execution Order
-
One thing that changed is the order in which we write our code. Now it becomes clear that we first
map
every entry of an array. Then wefilter
out some elements and finally do afold
that transforms a whole array into something different. - More functions
-
Perl itself only provides
map
andgrep
as functions operating on Arrays, but theArray
module shipped withSq
will have around 100 of common Array transformations you can use. Those are common problems you either have to write again and again in Perl yourself, or you just useSq
. - Code readability
-
Accessing transformations by a name also makes code better understandable. Without such pre-defined functions all you do is basically always again and again traversing arrays with a
for
loop usually to transform your data.But the reader has to decipher the meaning of a
for
loop. For example callingmin
on an Array will be much more better understandable as trying to understand thefor
loop that tries to implement the selection of a minimum value. - Speed & Performance
-
When we talk about this topic very often people only care for the performance in the sense of how much CPU time something take. But programming is not all about the CPU. As a human I also care how much time I need to write and come up with a correct solution that works.
Using
Sq
can be faster as you just can access a library of common data-transformations, but this library heavily depends on the usage of anonymous functions and calling functions. In today age it seems that the cost of just calling a function is sometimes underestimated. Every function call needs to push values onto a stack, needs jump instructions and so on. Calling functions can become a serious overhead, especially when we have code in loops and process/iterate millions of items.Still performance mustn't be bad. Functions in
Sq
are written to be as efficent as possible, usually trying to benchmark things, selection of proper algorithm and so on. So you can depend on having functions that are reasonable fast. The only overhead you usually pay is the price of calling lambda functions.Maybe your hand written implementation could be faster, maybe not. It is still a good thing relying on written and tested solutions and writing just one line of code instead of trying to unroll every function call to 10+ lines of code again and again.
- No forced paradigma
-
But consider the following thing.
Sq
does not try to enforce a paradigm on you, it tries to teach you just another paradigm that you maybe like because of it's advantages.You can start using
Sq
, create code that is fast written for you, does what it should do and is maybe fast enough. If not, you maybe can Profile your code and always create the Pure Perl versions of those operations that needs to be faster because they are critical.It is better to only unroll code at places where it really matters.
Line based Code
Now let's fix the problems we have with the first original pure Perl version. Instead of reading a whole file into memory we go through everything line based. Let's compare how the code changes.
We started with the following code.
use File::Slurp;
my @file_content = read_file('...');
my @transform =
grep { ... }
map { ... }
@file_content;
my $accum = ...;
for my $t ( @transform ) {
# do something with $accum
}
now we want to iterate line by line and do the transformation, filtering and creating of $accum as efficent as possible. Now we have code like this.
my $fh = open my $fh, '<', '...';
my $accum = ...;
while ( my $line = <$fh> ) {
# this is basically the map call
my $transform = transform($line);
# this is basically grep
if ( $transform ... ) {
# do something with $transform + $accum
}
}
This kind of code fixes the problem we had with the first solution. That means.
- Just line based
-
Instead of reading the whole file into memory we just process it line by line. We could easily process a 10 GiG file this way. It completely depends on
$accum
and what we create here and how big that data is if our process fails or not. But consider in the first version we would need the whole file into memory +$accum
. In this solution it is just$accum
+ 1 line. - Execution Order
-
Execution Order is now in order, we first see that we transform, then we filter. But still this code suffers from what i have deescribed above about
for
loops being hard to read. It becomes a lot harder to decipher as a human reading code what ourwhile
loop actually does. - Efficent
-
This code can be a lot more efficent, especially when we read something from a file. Instead that the processor now waits a lot of time for data to arrive, we immediately execute some code and create/process data to create
$accum
instead of just waiting for data to arrive.Reading something line based and directly doing something with it is a great way to speed up all kind of I/O processing, because while our code processes data our operating system is not sleeping. Usually the operating system starts to fill buffers so when a program ask for the next line it immediately can deliver that line without waiting.
This kind of buffers not only exists in different Perl IO Layers but also goes down back to the operating system, so the overal time our program needs to run will very likely become shorter. Instead of idling/waiting we now use that time to already process data.
Sequence based solution
But you know what I didn't liked at all? It's how different we need to write code. I like writing code with map
and grep
and some kind of helper functions from List::Util and so on. But when we need to write code that cannot read everything into memory at once than we cannot use those functions.
Then suddenly we have to unroll all those functions and write code that looks like primitive C code.
Now let's look what happens when we replace it with Seq
.
Here again is the code we had using Array
from Sq
.
use Sq;
use File::Slurp;
my $file_content = Array->bless([read_file('...')]);
my $accum =
$file_content
->map( sub($x) { ... })
->filter(sub($x) { ... })
->fold($state, sub($t,$state) {
# do something with $t & $state
});
And here is the sequence based solution that don't need to read the whole file into memory all at once.
use Sq;
my $file_content = Sq->io->open_text('...');
my $accum =
$file_content
->map( sub($x) { ... })
->filter(sub($x) { ... })
->fold($state, sub($t,$state) {
# do something with $t & $state
});
So what changed? Well not much. Sq->io->open_text
returns a sequence that iterates line by line through a file. But Seq
has the same API* as Array
. So the code looks exactly the same, but it has all the benefits of the Line Based Code.
It might be correct that Seq
has some overhead and all that lambda calling makes it slower. But this is only compared to an Array
. When we use Seq
on a file, socket or any other kind of Slow IO then we just trade a little bit more CPU consumption.
But this is totally fine. The purpose of CPUs is to compute things, not idling around doing nothing.
What if Seq is not fast enough?
Well than you have to unroll the Seq code into pure perl code and try to eleminate any function calling. That's how you speed up everything as much as possible in Perl.
But i consider that using Seq
or Sq
in general will help you to write code faster, still get a performance gain in most scenarios and transition between different versions are easier. Like switching from Array
to Seq
.
When something is not fast enough you can Profile your code. Look at Devel::NYTProf for this. And then just unroll the Seq
or Array
code into pure perl.
But it helps having code written with Seq
or Array
and then turn it into a series of single steps. Especially when you have a test-suite, tested your abstract code and then re-write some functions for speed.
*Same API
It is not quite true that Seq
and Array
have the same API. Some operation only makes sense on the one or the other. For example Array
provides functions like blitting and also some mutation that are not present on Seq
and cannot really implemented for an iterative solution.
Same goes the other way around as Seq
has the ability to handle infinite sequences that cannot be handled with an Array
.
But this is not an disadvantage. Some operations only really makes sense on different kind of data-structures, when you need them, you can use them.
But most functions exists on both data-structures and you can easily swap between the data-structures. On top also consider that you always can use methods like to_array
to transform a sequence to an array if needed, or use Seq->from_array
to turn an Array efficently into a Seq
.
So it is up to you to pick the right data-structure at the right place to create good performing programs.