Maximal efficiency to parse millions of json stored as a (very long) string-CodePudding

Objective:

I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.

big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'

Each JSON contains dozens of useless keys like garbage, but my objective is only to sum the qty for each mineral.

result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}

How to reproduce:

import random

minerals = ['gold', 'silver', 'adamantium']

big_string = str(
    '\n'.join([
        str({'mineral': random.choice(minerals), 
             'qty': random.randint(1,1000),
             'garbage': random.randint(1,666),
             'other_garbage': random.randint(-10,10)})
        for _ in range(150000)
    ])
)

def solution(big_string):
    # Show me your move
    return dict() # or pd.DataFrame()

My current solution (which I find slower than expected):

Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
Loading the string in json format using ujson library (supposed to be faster than json standard lib)
Accessing the values needed only for 'mineral' and 'quantity'.
Doing the aggregation using pandas

Which gives:

import ujson
import re
import pandas as pd

# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s "):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

def my_solution(big_string):

    useful_fields = ['mineral', 'qty']
    filtered_data = []
    
    for line in lines(big_string, sep="\n"):
        line = ujson.loads(line)
        filtered_data.append([line[field] for field in useful_fields])

    result = pd.DataFrame(filtered_data, columns = useful_fields)
    return result.groupby('mineral')['qty'].sum().reset_index()

Any improvement, even by 25%, would be great because I have thousands to do !

CodePudding user response：

I must confess: I'm going to use a library of mine - convtools (github)
rely on iterating io.StringIO, it splits lines by \n itself
process as a stream without additional allocations

import random
from io import StringIO
import ujson
from convtools import conversion as c

minerals = ["gold", "silver", "adamantium"]

big_string = str(
    "\n".join(
        [
            ujson.dumps(
                {
                    "mineral": random.choice(minerals),
                    "qty": random.randint(1, 1000),
                    "garbage": random.randint(1, 777),
                    "other_garbage": random.randint(-10, 10),
                }
            )
            for _ in range(150000)
        ]
    )
)

# define a conversion and generate ad hoc converter
converter = (
    c.iter(c.this.pipe(ujson.loads))
    .pipe(
        c.group_by(c.item("mineral")).aggregate(
            {
                "mineral": c.item("mineral"),
                "qty": c.ReduceFuncs.Sum(c.item("qty")),
            }
        )
    )
    .gen_converter()
)

# let's check
"""
In [48]: converter(StringIO(big_string))
Out[48]:
[{'mineral': 'silver', 'qty': 24954551},
 {'mineral': 'adamantium', 'qty': 25048483},
 {'mineral': 'gold', 'qty': 24975201}]

In [50]: OPs_solution(big_string)
Out[50]:
      mineral       qty
0  adamantium  25048483
1        gold  24975201
2      silver  24954551
"""

Let's profile:

In [53]: %timeit OPs_solution(big_string)
339 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [54]: %timeit converter(StringIO(big_string))
93.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

After looking for where time is spent, it appears 90% of total time is spent in the generator.

Changing it to : https://stackoverflow.com/a/59071238/4974431 was a 85% time improvement