Home > Back-end >  Any faster way to get pandas dataframe from a nested list
Any faster way to get pandas dataframe from a nested list

Time:10-12

Below is the working snippet of nested list to dataframe: . But, my actual list is huge. There is 1 M entry and for each entry there is data worth 1000 days. Is there any way to do the same faster?

list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'], 
                        'credit_score':[800, 890, 895],
                        'spend':[1500, 25000, 2400], 
                        'average_spend':5000}},
               {'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'],
                                   'credit_score':[2800, 390, 8900],
                                   'spend':[15000, 5000, 400], 
                                   'average_spend':3000}}]

df = pd.DataFrame(list_sample)
normalized_data = pd.json_normalize(df['data'])
df = normalized_data.apply(pd.Series.explode).reset_index()
df = df[['date', 'credit_score', 'spend']]

CodePudding user response:

couple of things that might help:

  1. numpy is usually faster than pandas
  2. json_normalize and concat are heavy, try to avoid or use once
  3. run speeds of different methods on small sample does not necessarily correlate to run speeds on a bigger sample

saying that, here is my suggestion (apparently it is somewhat faster on current example but, as said, this does not mean too much):

values = np.vstack([np.mat((x['data']['date'], x['data']['credit_score'], x['data']['spend'])).T for x in list_sample])
df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend']) 

or even better:

values = np.hstack([(x['data']['date'], x['data']['credit_score'], x['data']['spend']) for x in list_sample])
df = pd.DataFrame(values.T, columns=['date', 'credit_score', 'spend'])     

CodePudding user response:

For improve performance use list comprehension - ouput depends of data, so best test with real data:

values = [y for x in list_sample for y in 
          (zip(x['data']['date'], x['data']['credit_score'], x['data']['spend']))]
df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend']) 

Sample 1M:

list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'] * 100, 
                        'credit_score':[800, 890, 895] * 100,
                        'spend':[1500, 25000, 2400] * 100, 
                        'average_spend':5000}},
               {'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'] * 100,
                                   'credit_score':[2800, 390, 8900] * 100,
                                   'spend':[15000, 5000, 400] * 100, 
                                   'average_spend':3000}}]*1000


In [227]: %%timeit
     ...: values = [y  for x in list_sample for y in (zip(x['data']['date'], x['data']['credit_score'], x['data']['spend']))]
     ...: df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend'])
     ...: 
     ...: 
387 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [228]: %%timeit
     ...: values = np.vstack([np.mat((x['data']['date'], x['data']['credit_score'], x['data']['spend'])).T for x in list_sample])
     ...: df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend'])
     ...: 
820 ms ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [229]: %%timeit
     ...: values = np.hstack([(x['data']['date'], x['data']['credit_score'], x['data']['spend']) for x in list_sample])
     ...: df = pd.DataFrame(values.T, columns=['date', 'credit_score', 'spend'])
     ...: 
818 ms ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

Update. I now multiplied the initial list by 1M:

list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'] * 100, 
                        'credit_score':[800, 890, 895] * 100,
                        'spend':[1500, 25000, 2400] * 100, 
                        'average_spend':5000}},
               {'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'] * 100,
                                   'credit_score':[2800, 390, 8900] * 100,
                                   'spend':[15000, 5000, 400] * 100, 
                                   'average_spend':3000}}]*10000

And updated my code:

%%timeit
df2 = pd.DataFrame.from_dict([x["data"] for x in list_sample])
df2.explode(["date", "credit_score", "spend"])

Time:

872.9 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Update2: If the list is multiplied like:

list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'], 
                        'credit_score':[800, 890, 895],
                        'spend':[1500, 25000, 2400], 
                        'average_spend':5000}},
               {'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'],
                                   'credit_score':[2800, 390, 8900],
                                   'spend':[15000, 5000, 400], 
                                   'average_spend':3000}}]*1000000

The time will be:

4.38 s ± 32.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related