I have a really large JSONL file (14GiB, 12 Million lines) that contains records of lightning strokes around the world. What I am trying to do is to refine this dataset so that only lightning strokes that occurred in Germany remain at the end. Each line of the file contains one "strokes" list that holds many JSON objects (lightning strike). It looks like this:
"strokes": [
{
"time": 1624230617044,
"lat": 64.298728,
"lon": 44.536694,
"src": 2,
"srv": 2,
"id": 42243883,
"del": 1887,
"dev": 1941
}, #... other items
]
As you can see, one does not know in which country the lightning strike occurred since only latitude and longitude are given. One has to use the reverse_geocoder
library in order to map one particular set of coordinates to the country. The method can be used as follows:
rg.search(strike_location) # strike_location is a tuple that holds the gps coordinate
The output of the method is a json objects that holds the ISO Code of the country, 'DE' for Germany.
One way to achieve this task would be to iterate through the file, line after line and filter the data, turns out one query to the reverse_geocoder
library needs approx. 1.5sc to complete, which makes this approach really slow.
The other approach that I've thought of consists of splitting up the file in parts and assign each part to one particular process, let's say 16 parts, so I would create 16 Process, since I have 16 CPUs on my machine. If such an approach is possible, how could that be done? Or if you have some way to improve the first approach, it would help me a lot.
CodePudding user response:
Firstly, I assume that the location lookup is a network request. Doing these in parallel doesn't require much CPU at all, because it's not limited by your CPU, but by how fast you can send according requests, have them processed by the service and the response being returned to you. So, in short, the number of useful parallel requests has nothing to do with your number of CPU cores.
Now, for parallel requests to a web service, you don't need multiple processes. It's enough to use e.g. threads, but you could perhaps even get away without doing the threading yourself, if the module making the requests provides parallel request support.
That said, Germany is geographically pretty restricted. Doing a simple filter by min/max latitude and longitude will probably reduce the number of candidates a lot already. Consider doing that, especially if the service doing the geomapping is free, because you don't want to abuse their resources!
CodePudding user response:
I think it would be useful to use Dask https://docs.dask.org/en/stable/
Just use dask.dataframe.read_json method.
Also Dask supports multi-processing and even if it is a large file, you don't have to worry because Dask use lazy loading.