Home > Enterprise >  How to sort a jsonl file by value with the lowest RAM consumption?
How to sort a jsonl file by value with the lowest RAM consumption?

Time:03-15

I have a very big jsonl file (several million lines).
I want to sort this file on a given value, but I don't want to load it entirely in RAM.
Would you have a solution to suggest ?

I had a look at jq with a sort_by option, but I think the file is not streamed.

Extra note :

  • The order among a group does not matter
  • Having as many outputs as username is also good to me, if the method requires splitting the file.

Example :

Here is a dummy example of what my input file looks like :

{"username": "user1", "email": "email1", "value": "10"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user3", "email": "email1", "value": "40"}

Here is the output I would like :

{"username": "user1", "email": "email1", "value": "10"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user3", "email": "email1", "value": "40"}

CodePudding user response:

One approach is to transform the document to end up with lines that can be sorted by a tool that handles limited memory, such as the sort unix command-line utility.

You can use the following:

jq -r '"\( .username )\u0000\( tojson )"' a.json |
sort |
jq -Rc '. / "\u0000" | .[-1] | fromjson'

For the provided input, the above produces the following output:

{"username":"user1","email":"email1","value":"10"}
{"username":"user1","email":"email1","value":"40"}
{"username":"user1","email":"email1","value":"5"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user3","email":"email1","value":"40"}
{"username":"user3","email":"email3","value":"15"}

Along the same lines, you could produce a TSV (jq -r '"\( .username )\t\( tojson )"') you could inject into a database. Then it's a simple SQL query to extract the sorted JSON documents.

  • Related