Below is the working snippet of nested list to dataframe: . But, my actual list is huge. There is 1 M entry and for each entry there is data worth 1000 days. Is there any way to do the same faster?
list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'],
'credit_score':[800, 890, 895],
'spend':[1500, 25000, 2400],
'average_spend':5000}},
{'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'],
'credit_score':[2800, 390, 8900],
'spend':[15000, 5000, 400],
'average_spend':3000}}]
df = pd.DataFrame(list_sample)
normalized_data = pd.json_normalize(df['data'])
df = normalized_data.apply(pd.Series.explode).reset_index()
df = df[['date', 'credit_score', 'spend']]
CodePudding user response:
couple of things that might help:
- numpy is usually faster than pandas
- json_normalize and concat are heavy, try to avoid or use once
- run speeds of different methods on small sample does not necessarily correlate to run speeds on a bigger sample
saying that, here is my suggestion (apparently it is somewhat faster on current example but, as said, this does not mean too much):
values = np.vstack([np.mat((x['data']['date'], x['data']['credit_score'], x['data']['spend'])).T for x in list_sample])
df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend'])
or even better:
values = np.hstack([(x['data']['date'], x['data']['credit_score'], x['data']['spend']) for x in list_sample])
df = pd.DataFrame(values.T, columns=['date', 'credit_score', 'spend'])
CodePudding user response:
For improve performance use list comprehension
- ouput depends of data, so best test with real data:
values = [y for x in list_sample for y in
(zip(x['data']['date'], x['data']['credit_score'], x['data']['spend']))]
df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend'])
Sample 1M:
list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'] * 100,
'credit_score':[800, 890, 895] * 100,
'spend':[1500, 25000, 2400] * 100,
'average_spend':5000}},
{'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'] * 100,
'credit_score':[2800, 390, 8900] * 100,
'spend':[15000, 5000, 400] * 100,
'average_spend':3000}}]*1000
In [227]: %%timeit
...: values = [y for x in list_sample for y in (zip(x['data']['date'], x['data']['credit_score'], x['data']['spend']))]
...: df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend'])
...:
...:
387 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [228]: %%timeit
...: values = np.vstack([np.mat((x['data']['date'], x['data']['credit_score'], x['data']['spend'])).T for x in list_sample])
...: df = pd.DataFrame(values, columns=['date', 'credit_score', 'spend'])
...:
820 ms ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [229]: %%timeit
...: values = np.hstack([(x['data']['date'], x['data']['credit_score'], x['data']['spend']) for x in list_sample])
...: df = pd.DataFrame(values.T, columns=['date', 'credit_score', 'spend'])
...:
818 ms ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
Update. I now multiplied the initial list by 1M:
list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'] * 100,
'credit_score':[800, 890, 895] * 100,
'spend':[1500, 25000, 2400] * 100,
'average_spend':5000}},
{'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'] * 100,
'credit_score':[2800, 390, 8900] * 100,
'spend':[15000, 5000, 400] * 100,
'average_spend':3000}}]*10000
And updated my code:
%%timeit
df2 = pd.DataFrame.from_dict([x["data"] for x in list_sample])
df2.explode(["date", "credit_score", "spend"])
Time:
872.9 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Update2: If the list is multiplied like:
list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'],
'credit_score':[800, 890, 895],
'spend':[1500, 25000, 2400],
'average_spend':5000}},
{'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'],
'credit_score':[2800, 390, 8900],
'spend':[15000, 5000, 400],
'average_spend':3000}}]*1000000
The time will be:
4.38 s ± 32.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)