Split large JSON file by using jq and awk-CodePudding

I have a large file called

Metadata_01.json

It consistst of blocks that following this structure:

[
 {
  "Participant_id": "P04_00001",
  "no_of_people": "Multiple",
  "apparent_gender": "F",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Side View",
  "indoor_outdoor_env": "Indoors",
  "lighting_condition": "Bright",
  "Occluded": 1,
  "category": "Two Person",
  "camera_movement": "Still",
  "action": "No action",
  "indoor_outdoor_in_moving_car_or_train": "Indoor",
  "daytime_nighttime": "Nighttime"
 },
 {
  "Participant_id": "P04_00002",
  "no_of_people": "Single",
  "apparent_gender": "M",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Frontal View",
  "indoor_outdoor_env": "Outdoors",
  "lighting_condition": "Bright",
  "Occluded": "None",
  "category": "Animals",
  "camera_movement": "Still",
  "action": "Small action",
  "indoor_outdoor_in_moving_car_or_train": "Outdoor",
  "daytime_nighttime": "Daytime"
 },

And so on... thousands of them.

I am using the following command:

jq -cr '.[]' Metadata_01.json | awk '{print > (NR ".json")}'

And it's kinda doing the expected work.

From large file that is structured like this

I am getting tons of files that named like this

And structure like this (in one line)

Instead of those results I need each json file to be named after the "Participant_id" (e.g. P04_00002.json) And I want to preserve the json structure to look like that for each file

{
  "Participant_id": "P04_00002",
  "no_of_people": "Single",
  "apparent_gender": "M",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Frontal View",
  "indoor_outdoor_env": "Outdoors",
  "lighting_condition": "Bright",
  "Occluded": "None",
  "category": "Animals",
  "camera_movement": "Still",
  "action": "Small action",
  "indoor_outdoor_in_moving_car_or_train": "Outdoor",
  "daytime_nighttime": "Daytime"
 }

What adjustments should I make to the command above to achieve this? Or maybe there's an easier way to do this? Thank you!

CodePudding user response：

Would recommend using PowerShell since working with objects tends to be easier overall. Fortunately, PowerShell has a ConvertFrom-Json cmdlet you can use to convert the returned text into a PS object letting you reference the properties via dot notation (.Participant_id). Then, you'd just have to convert each iteration back to JSON format and export it. Here I use New-Item to create the file with the output but piping to Out-File would work as well.

$json = Get-Content -Path '.\Metadata_01.json' -Raw | ConvertFrom-Json 
foreach ($json_object in $json)
{
    New-Item -Path ".\Desktop\" -Name "$($json_object.Participant_id).json" -Value (ConvertTo-Json -InputObject $json_object) -ItemType 'File' -Force
}

The issue I can see you probably running into is not enough memory, due to the size of that file since you'll be saving to a variable first in this example. There are ways around it but this is for demonstration purposes.

CodePudding user response：

What adjustments should I make ...?

I'd go with:

jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk '
  NR%2==1 {id=$1;next} {print > "id." id ".json";}
'

One potential disadvantage of the above is that the output files will not be pretty-printed, but that can be dealt with in a number of ways, e.g. by getting awk to call jq.

"Big Data"

Of course if the input file is too large or too slow for jq empty, then you will want to consider alternatives, e.g. jq's --stream option, jstream, or my own jm. For example if you want the JSON to be pretty-printed in each file:

while read -r json
do
   fn=$(jq -r .Participant_id <<< "$json")
   <<< "$json" jq . > "id.$fn.json"
done < <(jm Metadata_01.json)