Where col1
contains <
followed by any number of spaces and any digits, I want to set col2
to the digits.
Here's a sample dataframe:
d = {'col1': ['62', '4 < s', '5<77', '< 10 '], 'col2': [3, 4, None, 9]}
df = pd.DataFrame(data=d)
col1 col2
0 62 3.0
1 4 < s 4.0
2 5<77 NaN
3 < 10 9.0
This code correctly extracts the digit I want after the pattern:
df['col1'].str.extract(r'<\s*(\d )')
0 NaN
1 NaN
2 77
3 10
However, when I try to set col2
to that line in a loc
statement, the outcome is unexpected. Where the condition is true, it sets col2
to null.
df.loc[(df['col1'].str.contains(r'<\s*\d ')),
'col2'] = df['col1'].str.extract(r'<\s*(\d )')
col1 col2
0 62 3.0
1 4 < s 4.0
2 5<77 NaN
3 < 10 NaN
What's more confusing is in my use case, col2
will be made of strings. When I make this change and run the same code, I get a ValueError:
d = {'col1': ['62', '4 < s', '5<77', '< 10 '], 'col2': ['3', '4', None, '9']}
df = pd.DataFrame(data=d)
df.loc[(df['col1'].str.contains(r'<\s*\d ')),
'col2'] = df['col1'].str.extract(r'<\s*(\d )')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/4w/zhsn7cr924bdcsncw_23jrf80000gn/T/ipykernel_88905/3527715470.py in <module>
----> 1 df.loc[(df['col1'].str.contains(r'<\s*\d ')),
2 'col2'] = df['col1'].str.extract(r'<\s*(\d )')
3 df
~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
721
722 iloc = self if self.name == "iloc" else self.obj.iloc
--> 723 iloc._setitem_with_indexer(indexer, value, self.name)
724
725 def _validate_key(self, key, axis: int):
~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1730 self._setitem_with_indexer_split_path(indexer, value, name)
1731 else:
-> 1732 self._setitem_single_block(indexer, value, name)
1733
1734 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1960
1961 elif isinstance(value, ABCDataFrame) and name != "iloc":
-> 1962 value = self._align_frame(indexer, value)
1963
1964 # check for chained assignment
~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in _align_frame(self, indexer, df)
2199 return val
2200
-> 2201 raise ValueError("Incompatible indexer with DataFrame")
2202
2203
ValueError: Incompatible indexer with DataFrame
CodePudding user response:
.str.extract
returns a dataframe, you may need extract the first column from it
df.loc[m, 'col2'] = df['col1'].str.extract(r'<\s*(\d )')[0]
CodePudding user response:
There is a simple way to keep original column values when Series.str.extract
does not find a match: use the .fillna(df[your_new_col_name])
.
You can use
df['col2'] = df['col1'].str.extract(r'<\s*(\d )')[0].fillna(df['col2'])
See a Pandas test:
import pandas as pd
d = {'col1': ['62', '4 < s', '5<77', '< 10 '], 'col2': [3, 4, None, 9]}
df = pd.DataFrame(data=d)
df['col2'] = df['col1'].str.extract(r'<\s*(\d )')[0].fillna(df['col2'])
Output:
>>> df
col1 col2
0 62 3.0
1 4 < s 4.0
2 5<77 77
3 < 10 10
Notes
type(df['col1'].str.extract(r'<\s*(\d )'))
outputs<class 'pandas.core.frame.DataFrame'>
, you need to extract the first column value by using the[0]
index.fillna(df['col2'])
fills the NA values with the original value fromcol2
.