Home > other >  Why avro, or Parquet format is faster than csv?
Why avro, or Parquet format is faster than csv?

Time:02-10

I can read articles where I can read that csv is slower and bad for large dataset. but I cant understand what is it with avro/parquet internally which makes it faster for larger dataset than csv.

Thanks in advance.

CodePudding user response:

The ordering of preferred data formats (in a Hadoop context) is typically ORC, Parquet, Avro, SequenceFile, then PlainText.

Primary reason against CSV is that it is just a string, meaning the dataset is larger by storing all characters according to the file-encoding (UTF8, for example); there is no type-information or schema that is associated with the data, and it will always be parsed while deserialized. In other words, when storing a boolean field, for example, you really only need one bit in binary, but in CSV, you have to store the full bytes of "true", "false" or a string of "0", or "1", which as ASCII is still a full 8 bits.

ORC and Parquet, on the other hand, maintain type information and support columnar push-down predicates for faster analytics (like an RDBMS)

Avro is a row-based format. Primarily for network transfer, not long-term storage. Avro can easily be converted into Parquet. Since it is still typed and binary, it will consume less space than CSV and is still faster to process than plaintext.

SequenceFiles are a middle-ground for Hadoop, but aren't widely supported by other tooling.

CodePudding user response:

Binary data is always faster than the same textual representation. Avro sends data over the wire in binary format and the keys are also omitted making the packet size smaller. Hence, avro is good for bigger data formats.

  • Related