Home > Software design >  Maximal efficiency to parse millions of json stored as a (very long) string
Maximal efficiency to parse millions of json stored as a (very long) string

Time:07-29

Objective:

I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.

big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'

Each JSON contains dozens of useless keys like garbage, but my objective is only to sum the qty for each mineral.

result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}

How to reproduce:

import random

minerals = ['gold', 'silver', 'adamantium']

big_string = str(
    '\n'.join([
        str({'mineral': random.choice(minerals), 
             'qty': random.randint(1,1000),
             'garbage': random.randint(1,666),
             'other_garbage': random.randint(-10,10)})
        for _ in range(150000)
    ])
)

def solution(big_string):
    # Show me your move
    return dict() # or pd.DataFrame()

My current solution (which I find slower than expected):

  • Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
  • Loading the string in json format using ujson library (supposed to be faster than json standard lib)
  • Accessing the values needed only for 'mineral' and 'quantity'.
  • Doing the aggregation using pandas

Which gives:

import ujson
import re
import pandas as pd

# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s "):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

def my_solution(big_string):

    useful_fields = ['mineral', 'qty']
    filtered_data = []
    
    for line in lines(big_string, sep="\n"):
        line = ujson.loads(line)
        filtered_data.append([line[field] for field in useful_fields])

    result = pd.DataFrame(filtered_data, columns = useful_fields)
    return result.groupby('mineral')['qty'].sum().reset_index()

Any improvement, even by 25%, would be great because I have thousands to do !

CodePudding user response:

  1. I must confess: I'm going to use a library of mine - convtools (github)

  2. rely on iterating io.StringIO, it splits lines by \n itself

  3. process as a stream without additional allocations

import random
from io import StringIO
import ujson
from convtools import conversion as c

minerals = ["gold", "silver", "adamantium"]

big_string = str(
    "\n".join(
        [
            ujson.dumps(
                {
                    "mineral": random.choice(minerals),
                    "qty": random.randint(1, 1000),
                    "garbage": random.randint(1, 777),
                    "other_garbage": random.randint(-10, 10),
                }
            )
            for _ in range(150000)
        ]
    )
)

# define a conversion and generate ad hoc converter
converter = (
    c.iter(c.this.pipe(ujson.loads))
    .pipe(
        c.group_by(c.item("mineral")).aggregate(
            {
                "mineral": c.item("mineral"),
                "qty": c.ReduceFuncs.Sum(c.item("qty")),
            }
        )
    )
    .gen_converter()
)

# let's check
"""
In [48]: converter(StringIO(big_string))
Out[48]:
[{'mineral': 'silver', 'qty': 24954551},
 {'mineral': 'adamantium', 'qty': 25048483},
 {'mineral': 'gold', 'qty': 24975201}]

In [50]: OPs_solution(big_string)
Out[50]:
      mineral       qty
0  adamantium  25048483
1        gold  24975201
2      silver  24954551
"""

Let's profile:

In [53]: %timeit OPs_solution(big_string)
339 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [54]: %timeit converter(StringIO(big_string))
93.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

After looking for where time is spent, it appears 90% of total time is spent in the generator.

Changing it to : https://stackoverflow.com/a/59071238/4974431 was a 85% time improvement

  • Related