Dealing with large number of small json files using pyspark-CodePudding

I have around 376K of JSON files under a directory in S3. These files are 2.5 KB each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL with 20 workers:

spark.read.json("path")

It just didn't run. There was a Timeout after 5 hrs. So, I developed and ran a shell script to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.

I used the below command to append the JSON records from different files under a single file:

for f in Agent/*.txt; do cat ${f} >> merged.json;done;

It doesn't have any nested JSON. I even tried the multiline option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10 to display the top 10 lines but it goes to an infinite loop.

CodePudding user response：

I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed

CodePudding user response：

The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.

Since I was dealing with a JSON dataset, I used jq utility to process it. Below is the shell script that would merge a large number of records into one file:

find . -name '*.txt' -exec cat '{}'   | jq -s '.' > output.txt

Later on, I was able to load the JSON records as expected.