Home > database >  pandas: cannot set column with substring extracted from other column
pandas: cannot set column with substring extracted from other column

Time:06-08

I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.

Here is some example code that illustrates the problem I am facing:

import pandas as pd

data = [
    {'type': 'A', 'base_col': 'key=val'},
    {'type': 'B', 'base_col': 'other_val'},
    {'type': 'A', 'base_col': 'key=val'},
    {'type': 'B', 'base_col': 'other_val'}
]

df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')

print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))

The output I get from the above code is as follows:

df:
  type   base_col  derived_col
0    A    key=val          NaN
1    B  other_val          NaN
2    A    key=val          NaN
3    B  other_val          NaN
mask:
0     True
1    False
2     True
3    False
Name: type, dtype: bool
extraction:
     0
0  val
2  val

The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.

Please clarify what I am getting wrong here. Thanks!

CodePudding user response:

You should assign the serise not df , check the column should pick 0

mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]

df
Out[449]: 
  type   base_col derived_col
0    A    key=val         val
1    B  other_val         NaN
2    A    key=val         val
3    B  other_val         NaN 
  • Related