Home > front end >  Expand embedded list of dictionnaries in a DataFrame as new columns of the DataFrame
Expand embedded list of dictionnaries in a DataFrame as new columns of the DataFrame

Time:05-09

I have a Pandas DataFrame looking like:

import pandas as pd
print(pd.__version__)

df0 = pd.DataFrame([
 [12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
             {'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
             {'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
             {'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
 [13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
             {'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
 ])

Each of the list in the column 2 is having the exact same keys (dst, object and admin).

There can be between 0 (empty []) and 100 lists for each row of the df0.

I wish I could expand the df0 DataFrame to look like this:

columns = ['id', 'name', 'dst', 'object', 'admin']

df_wanted
Out[416]: 
     id name  dst  object                admin
    12  None  925 'Lok. Certification'   'valid'
    12  None  935 'Lok. Administration'  'true'
    12  None  944 'Lok. Customer'        'false'
    12  None  945 'Lok. Customer'        'false'
    12  None  955 'Lok. Certification-C' 'invalid'
    12  None  956 'Lok. Certification'   'valid'
    13   wXB  987 'Lok. Fral_heater'     'valid'
    13   wXB  986 'Lok. Fral_cond.'      'valid'
    ...

Notice that the two first columns, id and name, are replicated along the rows to fit the number of elements within their list.

(The dst column must be cast to an int using .astype(int) at the end.)

How could I achieve that?

Info:

Python 3.10.4
pd.__version__
'1.4.2'

CodePudding user response:

You can explode the column first, then convert the dictionaries to columns:

df0 = df0.explode(2, ignore_index=True)    
df0 = pd.concat([df0, df0[2].apply(pd.Series)], axis=1).drop(columns=2)

CodePudding user response:

The other answers both work well. Instead of building Series objects line by line, you could build a single DataFrame object and join back to the original. This one should be a bit faster.

df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))

Output:

    0     1  dst                object    admin
0  12  None  925    Lok. Certification    valid
1  12  None  935   Lok. Administration     true
2  12  None  944         Lok. Customer    false
3  12  None  945         Lok. Customer    false
4  12  None  954  Lok. Certification-C  invalid
5  12  None  956    Lok. Certification    valid
6  13   wXB  986           Fral_heater    valid
7  13   wXB  987            Fral_cond.    valid

Benchmark:

>>> %timeit df1 = df0.explode(2, ignore_index=True); df1 = pd.concat([df1, df1[2].apply(pd.Series)], axis=1).drop(columns=2)
8.4 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
4.8 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response:

I'd recommend expanding the column into a new df, then joining back to the main df:

# Copied from above
df = pd.DataFrame([
 [12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
             {'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
             {'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
             {'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
 [13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
             {'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
 ])

# Set the names of the columns
df.columns = ['id', 'name', 'object']

 # Create a new df from the column
df_tmp = df['object'].explode().apply(pd.Series)

# Join to original
df = pd.concat([df[['id', 'name']], df_tmp], axis=1).reset_index(drop=True)

# Result:
|    |   id | name   |   dst | object               | admin   |
|---:|-----:|:-------|------:|:---------------------|:--------|
|  0 |   12 |        |   925 | Lok. Certification   | valid   |
|  1 |   12 |        |   935 | Lok. Administration  | true    |
|  2 |   12 |        |   944 | Lok. Customer        | false   |
|  3 |   12 |        |   945 | Lok. Customer        | false   |
|  4 |   12 |        |   954 | Lok. Certification-C | invalid |
|  5 |   12 |        |   956 | Lok. Certification   | valid   |
|  6 |   13 | wXB    |   986 | Fral_heater          | valid   |
|  7 |   13 | wXB    |   987 | Fral_cond.           | valid   |
  • Related