Parse data from a pandas column containing dicts to a new column-CodePudding

I am trying to parse data from a pandas column containing dicts to a new column. However, I get an value error when I attempt to do the following.

import pandas as pd

d = pd.DataFrame({
                  'id': [0, 1, 2],
                  'str': [{'a':'1'},{'a':'2'},np.nan]
                 })

d['new_col'] = d.apply(lambda x: d['str'].str['a'] if pd.notnull(x) else x, axis=1)

Traceback:

/Applications/Anaconda/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1525     @final
   1526     def __nonzero__(self):
-> 1527         raise ValueError(
   1528             f"The truth value of a {type(self).__name__} is ambiguous. "
   1529             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response：

d.apply(lambda x: d['str'].str['a'] if pd.notnull(x) else x, axis=1)

What is happening is that you are applying the function to each row of the DataFrame d, so x stands for each row (a pd.Series), not each dictionary (or NaN value) of the column 'str'. Since the truthiness of a Series is ambiguous the error is raised due to the boolean check pd.notnull(x).

Try this instead

d['new_col'] = d['str'].apply(lambda x: x['a'] if pd.notnull(x) else x)

Output:

>>> d

   id         str new_col
0   0  {'a': '1'}       1
1   1  {'a': '2'}       2
2   2         NaN     NaN

CodePudding user response：

Don't use apply that will be slow on large datasets, but rather the str accessor:

d['new_col'] = d['str'].str['a']
# or
d['new_col'] = d['str'].str.get('a')

Output:

   id         str new_col
0   0  {'a': '1'}       1
1   1  {'a': '2'}       2
2   2         NaN     NaN