I have a very big jsonl file (several million lines).
I want to sort this file on a given value, but I don't want to load it entirely in RAM.
Would you have a solution to suggest ?
I had a look at jq
with a sort_by
option, but I think the file is not streamed.
Extra note :
- The order among a group does not matter
- Having as many outputs as username is also good to me, if the method requires splitting the file.
Example :
Here is a dummy example of what my input file looks like :
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user3", "email": "email1", "value": "40"}
Here is the output I would like :
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user3", "email": "email1", "value": "40"}
CodePudding user response:
One approach is to transform the document to end up with lines that can be sorted by a tool that handles limited memory, such as the sort
unix command-line utility.
You can use the following:
jq -r '"\( .username )\u0000\( tojson )"' a.json |
sort |
jq -Rc '. / "\u0000" | .[-1] | fromjson'
For the provided input, the above produces the following output:
{"username":"user1","email":"email1","value":"10"}
{"username":"user1","email":"email1","value":"40"}
{"username":"user1","email":"email1","value":"5"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user3","email":"email1","value":"40"}
{"username":"user3","email":"email3","value":"15"}
Along the same lines, you could produce a TSV (jq -r '"\( .username )\t\( tojson )"'
) you could inject into a database. Then it's a simple SQL query to extract the sorted JSON documents.