Objective:
I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.
big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'
Each JSON contains dozens of useless keys like garbage
, but my objective is only to sum the qty
for each mineral
.
result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}
How to reproduce:
import random
minerals = ['gold', 'silver', 'adamantium']
big_string = str(
'\n'.join([
str({'mineral': random.choice(minerals),
'qty': random.randint(1,1000),
'garbage': random.randint(1,666),
'other_garbage': random.randint(-10,10)})
for _ in range(150000)
])
)
def solution(big_string):
# Show me your move
return dict() # or pd.DataFrame()
My current solution (which I find slower than expected):
- Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
- Loading the string in json format using ujson library (supposed to be faster than json standard lib)
- Accessing the values needed only for 'mineral' and 'quantity'.
- Doing the aggregation using pandas
Which gives:
import ujson
import re
import pandas as pd
# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s "):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
def my_solution(big_string):
useful_fields = ['mineral', 'qty']
filtered_data = []
for line in lines(big_string, sep="\n"):
line = ujson.loads(line)
filtered_data.append([line[field] for field in useful_fields])
result = pd.DataFrame(filtered_data, columns = useful_fields)
return result.groupby('mineral')['qty'].sum().reset_index()
Any improvement, even by 25%, would be great because I have thousands to do !
CodePudding user response:
I must confess: I'm going to use a library of mine - convtools (github)
rely on iterating
io.StringIO
, it splits lines by\n
itselfprocess as a stream without additional allocations
import random
from io import StringIO
import ujson
from convtools import conversion as c
minerals = ["gold", "silver", "adamantium"]
big_string = str(
"\n".join(
[
ujson.dumps(
{
"mineral": random.choice(minerals),
"qty": random.randint(1, 1000),
"garbage": random.randint(1, 777),
"other_garbage": random.randint(-10, 10),
}
)
for _ in range(150000)
]
)
)
# define a conversion and generate ad hoc converter
converter = (
c.iter(c.this.pipe(ujson.loads))
.pipe(
c.group_by(c.item("mineral")).aggregate(
{
"mineral": c.item("mineral"),
"qty": c.ReduceFuncs.Sum(c.item("qty")),
}
)
)
.gen_converter()
)
# let's check
"""
In [48]: converter(StringIO(big_string))
Out[48]:
[{'mineral': 'silver', 'qty': 24954551},
{'mineral': 'adamantium', 'qty': 25048483},
{'mineral': 'gold', 'qty': 24975201}]
In [50]: OPs_solution(big_string)
Out[50]:
mineral qty
0 adamantium 25048483
1 gold 24975201
2 silver 24954551
"""
Let's profile:
In [53]: %timeit OPs_solution(big_string)
339 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [54]: %timeit converter(StringIO(big_string))
93.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
After looking for where time is spent, it appears 90% of total time is spent in the generator.
Changing it to : https://stackoverflow.com/a/59071238/4974431 was a 85% time improvement