Context: Currently, my team has a perl script that does many things, one of which is that it stores 1 byte hex-values in a hashmap(array yields similar results). The input data can range from hundreds of MB to 10s of GBs. At the moment, when we run a 1GB input (1 billion entries), the script takes about 10 minutes and then errors out due to using all 16GB of my RAM. I'm told a 1GB input can expand to nearly 300GB.
We then made a comparable C program, and found that it takes a few minutes and uses only 1.1GBs.
I wrote the code below in order to simply test how C and Perl perform for writing 1 Billion values. I'm finding that the Perl code takes around 186seconds and >70GB of memory to run, while the C code only takes two seconds and 1GB. I use time and memusage to determine measurements.
Question: Is perl actually this slow and bad at memory management, or am I missing something. The literature i've read online, has the Perl should be slower due to the flexibility that it provides, but relatively not super slow since it is written in C.
Perl code example of memory usage:
use strict;
my @list;
for(my $a=0; $a < 1000000000;$a ){
$list[$a]=1; # 1 is just to simulate some data.
}
print 'done';
C code:
#include <stdlib.h>
int main() {
int size = 1000000000;
unsigned char *data = (unsigned char*)malloc(size*sizeof(unsigned char));
unsigned char byte = 'a';
int address = 0;
while (address < size) {
data[address]=byte;
address ;
}
printf("done %i.\n",address);
return 0;
}
Also tried in python, which was worse than perl in terms of speed.
data = []
d = format(231,'#04x')
while address < 1000000000:
data.append(d)
address = 1
print "done"
while(1):
continue
note: I haven't used a profiler yet since the evaluation code is simple.
Because of these performance issues, I found a solution where called SWIG which allows me to wrap C code and run it in perl; however, I have some follow up questions about it. :)
edit: tag
CodePudding user response:
An array (scalar of type SVt_PVAV
) takes 64 bytes on my system.
$ perl -Mv5.10 -MDevel::Size=size -e'my @a; say size( \@a );'
64
This includes the the fields common to all variables (refcount, variable type, flags, pointer to the body), plus the fields specific to SVt_PVAV
(total size, size used, pointer to the underlying array of pointers).
This doesn't include the actual pointers to the scalars it contains.
The size of a scalar that can only contain an integer (SVt_IV
) is 24 bytes on my system.
$ perl -Mv5.10 -MDevel::Size=size -e'my $i = 1; say size( $i );'
24
This includes the the fields common to all variables: (refcount, variable type, flags, pointer to the body), plus the fields specific to SVt_IV
(the integer).
So we're talking about 64 1,000,000,000 * ( 8 24 ) = 32e9 bytes. Plus intentional over-allocation of the array (to avoid having to realloc each time you add an element). Plus the overhead of 1,000,000,003 memory blocks. It's not unimaginable that this would take a total of 70e9 bytes.
As for the speed, all these allocations add up. And of course, you're doing arithmetic on a scalar, not an int
. This involves pointers, type checks and flag checks every single time you increment it.
There is a price to the convenience of variables which can hold data of any type, arrays that can be expanded at will, and automatic memory deallocation. But the benefits are also immense.
CodePudding user response:
It's not "bad at memory management", but this is a bad idea due to use of an inappropriate data structure. A Perl hash on a 64-bit system has very approximately 120 * n 120 * ⌈log2(n)⌉
bytes of overhead, in addition to the size of the keys and values stored. If you suppose your keys are 4 bytes, your values are 1 byte, and you have a billion of them, then your real information content is 5 gigabytes, and the overhead is 120 gigabytes (plus small change).
This overhead is used for the things that make Perl convenient to work with: dynamic typing, automatic reference counting, etc. And in many reasonable situations it doesn't cause any problem. If you store things on the order of a thousand bytes then the overhead is 10% instead of 2400%. If you only store a hundred small things, then you might not care that you're using 12kB to do so.
But if you're pushing the limits a little bit then you need to be more creative in coming up with something that suits your application instead of a one-size-fits-all hash table. I can't give specific advice here because the right answer depends on details of what you're storing and how it needs to be accessed, beyond what you've given. It could be as simple as a single 1GB string accessed using substr
, which will still only take up 1.0001GB even in Perl, or a radix tree approach that would use less space if the key space is sparse.
Of course, if you have a working C version, you should feel free to use that, and you can call it from Perl.