I would like to know why it took longer to import a data set. I'm using pandas and I have same file one version on XLSX and the other on CSV. How come the CSV files is faster to upload?
CodePudding user response:
In general, CSV files are much less complicated than .xlsx files. csv is "raw data", while xlsx also stores information about formatting, font, color, and other cell formatting configurations. SO I'm no expert but for sure csv files would be lighter and also faster to read
CodePudding user response:
CSV is an acronym for "comma separated values." A CSV is literally lines of values separated by a delimiter such as a comma, tab, or semicolon.
person,age,fav_animal
bob,20,cat
mary,16,duck
XLSX is a complicated binary format with a specification that is over 1000 pages long. Parsers have to validate the format and extract important objects.
Parsing CSVs is faster than reading XLSX partially because the format is rudimentary, but that isn't the only reason. Binary formats designed for data in general or even specific classes of data, such as HDF or Parquet, are even faster to parse as well as more space efficient than CSV. XLSX is designed for spreadsheets and the requisite complexity of them.