I have around 376K
of JSON
files under a directory in S3
. These files are 2.5 KB
each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL
with 20 workers
:
spark.read.json("path")
It just didn't run. There was a Timeout
after 5 hrs. So, I developed and ran a shell script
to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB
. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.
I used the below command to append the JSON records from different files under a single file:
for f in Agent/*.txt; do cat ${f} >> merged.json;done;
It doesn't have any nested JSON. I even tried the multiline
option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10
to display the top 10 lines
but it goes to an infinite loop.
CodePudding user response:
I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed
CodePudding user response:
The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.
Since I was dealing with a JSON
dataset, I used jq
utility to process it. Below is the shell script that would merge a large number of records into one file:
find . -name '*.txt' -exec cat '{}' | jq -s '.' > output.txt
Later on, I was able to load the JSON
records as expected.