I learnt, the parquet file format stores a bunch of metadata and uses various compressions to store data in an efficient way, when it comes to size and query-speed.
And it possibly generates multiple files out of, let's say: one input, like from a Panda dataframe.
Now, I have a large CSV file and I want to convert it into a parquet file format. Naively, I would remove the header (store elsewhere for later) and chunk the file up in blocks with n lines. Then turn each chunk into parquet (here Python):
table = pyarrow.csv.read_csv(fileName)
pyarrow.parquet.write_table(table, fileName.replace('csv', 'parquet'))
I guess the method doesn't much matter. From what I see, at least with a small test data set and no extra context, I get one parquet file per csv file (1:1).
For now that is all I need, as I am not doing queries on "the whole", logical data set. I use the raw files, as input to a further cleaning step that is nifty to do with the csv format. And I haven't yet tried reading the files...
Do I have to readd the header to each CSV chunk at the least?
Is this as straight-forward as I think, or am I missing something?
CodePudding user response:
When creating a parquet dataset with Mutiple files, All the files should have matching schema. In your case, when you split the csv file into Mutiple parquet files, you will have to include the csv headers in each chunk to create a valid parquet file.
Note that parquet is a compressed format (with a high compression ratio). Parquet data will be much smaller than the csv data. On top of that, applications that read parquet file usually prefer fewer large parquet file and not many small parquet files.
CodePudding user response:
An easy way to write a partitioned parquet file is with dask.dataframe
. You could even read in the data with dask.dataframe.read_csv
and then you don't have to do any conversion:
import dask.dataframe
# here, the block size will determine the partition boundaries, which will
# be preserved in the parquet file. So if you have a 5 GB file, this would
# write 50 partitions:
df = dask.dataframe.read_csv(fileName, blocksize="100MB")
df.to_parquet(fileName.replace(".csv", ".parquet"))