Home > Net >  Performance improvements when using template-haskell to deserialize a data type
Performance improvements when using template-haskell to deserialize a data type

Time:02-21

I'm compiling a Haskell executable that, on startup, reads about 50MB of data from the file system that has been serialized using the serialise package and then applies some transformations to it before continuing.

I'd like to improve the start up speed of the executable, and I can theoretically use template haskell to deserialize the files and write them as data constructors. But I'm wondering if this would actually improve performance? If the bulk of the time the code takes is calling the data constructors (meaning if the file IO and deserialization is fast) then it wouldn't be worth it, whereas if calling the data constructors is fast then it may be worth it.

Also, does GHC have any notion of compile-time evaluation for large data structures. Ie if I have something of type [Foo] that is known at compile time and contains ~50MB of data, is there any way that the executable can contain that precompiled in whatever the haskell equivalent of the stack is, or will it be lazily evaluated like everything else?

Thanks in advance for your help & advice!

CodePudding user response:

I'm pessimistic. You seem unlikely to save time on file I/O: if you deserialize 50MB worth of stuff at compile time, you have to bake that into the executable, and it will probably get about 50MB larger, assuming that the serialization format and GHC's format are both reasonably efficient encodings. Thus, loading the executable into memory will get slower, by about the amount of time you were previously spending on reading the data file.

Likewise, GHC will have to deserialize whatever format it uses to bake the data into the executable. A program could avoid this if the in-memory data structure were identical to the on-disk representation, but I can't imagine that being the case, since the normal in-memory representation is rife with pointers. Here again, it seems likely that GHC's internal format is not much cheaper to deserialize than CBOR, so any costs you avoid by not reading the file, you will incur by making the executable slower to prepare.

  • Related