I have a nested list of len ~ 1M. Snippet below. I wish to convert this to dataframe as shown below.
import pandas as pd
list_of_data = [{'id':1, 'name': 'A', 'results':{'test_date':['2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13'], 'Score': [12, 23, 23, 12, 11, 13, 13]}},
{'id':2, 'name': 'B', 'results':{'test_date':['2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13'], 'Score': [12, 23, 23, 12, 11, 13, 13]}}]
Expected dataframe:
df = pd.DataFrame({'id':[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], 'test_date':['2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13', '2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13'], 'Score': [12, 23, 23, 12, 11, 13, 13, 12, 23, 23, 12, 11, 13, 13]})
Is it possible to do this?
CodePudding user response:
Use list comprehension and pass to DataFrame constructor:
L = [(x['id'], y, z) for x in list_of_data for y, z in
zip(x['results']['test_date'], x['results']['Score'])]
df = pd.DataFrame(L, columns=['id','test_date','Score'])
print (df)
id test_date Score
0 1 2020-06-29 12
1 1 2020-07-02 23
2 1 2020-07-05 23
3 1 2020-07-09 12
4 1 2020-07-10 11
5 1 2020-07-11 13
6 1 2020-07-13 13
7 2 2020-06-29 12
8 2 2020-07-02 23
9 2 2020-07-05 23
10 2 2020-07-09 12
11 2 2020-07-10 11
12 2 2020-07-11 13
13 2 2020-07-13 13
Or json_normalize
with DataFrame.explode
, rename
columns and filter columns by list:
df = (pd.json_normalize(list_of_data)
.explode(['results.test_date','results.Score'])
.rename(columns={'results.test_date':'test_date','results.Score':'Score'})
[['id','test_date','Score']])
CodePudding user response:
One option is with defaultdict
:
from collections import defaultdict
box = defaultdict(list)
for entry in list_of_data:
dates = entry['results']['test_date']
scores = entry['results']['Score']
ids = [entry['id']] * len(dates)
box['id'].extend(ids)
box['test_date'].extend(dates)
box['Score'].extend(scores)
pd.DataFrame(box)
id test_date Score
0 1 2020-06-29 12
1 1 2020-07-02 23
2 1 2020-07-05 23
3 1 2020-07-09 12
4 1 2020-07-10 11
5 1 2020-07-11 13
6 1 2020-07-13 13
7 2 2020-06-29 12
8 2 2020-07-02 23
9 2 2020-07-05 23
10 2 2020-07-09 12
11 2 2020-07-10 11
12 2 2020-07-11 13
13 2 2020-07-13 13