Home > Blockchain >  Convert complex list to dataframe in a vectorized way
Convert complex list to dataframe in a vectorized way

Time:10-07

I have a nested list of len ~ 1M. Snippet below. I wish to convert this to dataframe as shown below.

import pandas as pd

list_of_data = [{'id':1, 'name': 'A', 'results':{'test_date':['2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13'], 'Score': [12, 23, 23, 12, 11, 13, 13]}},
{'id':2, 'name': 'B', 'results':{'test_date':['2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13'], 'Score': [12, 23, 23, 12, 11, 13, 13]}}]

Expected dataframe:

df = pd.DataFrame({'id':[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], 'test_date':['2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13', '2020-06-29', '2020-07-02', '2020-07-05', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-13'], 'Score': [12, 23, 23, 12, 11, 13, 13, 12, 23, 23, 12, 11, 13, 13]})

Is it possible to do this?

CodePudding user response:

Use list comprehension and pass to DataFrame constructor:

L = [(x['id'], y, z) for x in list_of_data for y, z in 
                                    zip(x['results']['test_date'], x['results']['Score'])]
 
df = pd.DataFrame(L, columns=['id','test_date','Score'])   
print (df)
    id   test_date  Score
0    1  2020-06-29     12
1    1  2020-07-02     23
2    1  2020-07-05     23
3    1  2020-07-09     12
4    1  2020-07-10     11
5    1  2020-07-11     13
6    1  2020-07-13     13
7    2  2020-06-29     12
8    2  2020-07-02     23
9    2  2020-07-05     23
10   2  2020-07-09     12
11   2  2020-07-10     11
12   2  2020-07-11     13
13   2  2020-07-13     13

Or json_normalize with DataFrame.explode, rename columns and filter columns by list:

df = (pd.json_normalize(list_of_data)
        .explode(['results.test_date','results.Score'])
        .rename(columns={'results.test_date':'test_date','results.Score':'Score'})
        [['id','test_date','Score']])

CodePudding user response:

One option is with defaultdict:

from collections import defaultdict
box = defaultdict(list)
for entry in list_of_data:
    dates = entry['results']['test_date']
    scores = entry['results']['Score']
    ids = [entry['id']] * len(dates)
    box['id'].extend(ids)
    box['test_date'].extend(dates)
    box['Score'].extend(scores)

pd.DataFrame(box) 
    id   test_date  Score
0    1  2020-06-29     12
1    1  2020-07-02     23
2    1  2020-07-05     23
3    1  2020-07-09     12
4    1  2020-07-10     11
5    1  2020-07-11     13
6    1  2020-07-13     13
7    2  2020-06-29     12
8    2  2020-07-02     23
9    2  2020-07-05     23
10   2  2020-07-09     12
11   2  2020-07-10     11
12   2  2020-07-11     13
13   2  2020-07-13     13

  • Related