Errors when set column in df to extracted pattern given condition-CodePudding

Where col1 contains < followed by any number of spaces and any digits, I want to set col2 to the digits.

Here's a sample dataframe:

d = {'col1': ['62', '4 < s', '5<77', '< 10 '], 'col2': [3, 4, None, 9]}
df = pd.DataFrame(data=d)


col1        col2
0   62      3.0
1   4 < s   4.0
2   5<77    NaN
3   < 10    9.0

This code correctly extracts the digit I want after the pattern:

df['col1'].str.extract(r'<\s*(\d )')

0   NaN
1   NaN
2   77
3   10

However, when I try to set col2 to that line in a loc statement, the outcome is unexpected. Where the condition is true, it sets col2 to null.

df.loc[(df['col1'].str.contains(r'<\s*\d ')),
       'col2'] = df['col1'].str.extract(r'<\s*(\d )')

    col1    col2
0   62      3.0
1   4 < s   4.0
2   5<77    NaN
3   < 10    NaN

What's more confusing is in my use case, col2 will be made of strings. When I make this change and run the same code, I get a ValueError:

d = {'col1': ['62', '4 < s', '5<77', '< 10 '], 'col2': ['3', '4', None, '9']}
df = pd.DataFrame(data=d)
df.loc[(df['col1'].str.contains(r'<\s*\d ')),
       'col2'] = df['col1'].str.extract(r'<\s*(\d )')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/4w/zhsn7cr924bdcsncw_23jrf80000gn/T/ipykernel_88905/3527715470.py in <module>
----> 1 df.loc[(df['col1'].str.contains(r'<\s*\d ')),
      2        'col2'] = df['col1'].str.extract(r'<\s*(\d )')
      3 df

~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    721 
    722         iloc = self if self.name == "iloc" else self.obj.iloc
--> 723         iloc._setitem_with_indexer(indexer, value, self.name)
    724 
    725     def _validate_key(self, key, axis: int):

~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
   1730             self._setitem_with_indexer_split_path(indexer, value, name)
   1731         else:
-> 1732             self._setitem_single_block(indexer, value, name)
   1733 
   1734     def _setitem_with_indexer_split_path(self, indexer, value, name: str):

~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
   1960 
   1961         elif isinstance(value, ABCDataFrame) and name != "iloc":
-> 1962             value = self._align_frame(indexer, value)
   1963 
   1964         # check for chained assignment

~/.pyenv/versions/3.9.4/envs/venv-slp2/lib/python3.9/site-packages/pandas/core/indexing.py in _align_frame(self, indexer, df)
   2199             return val
   2200 
-> 2201         raise ValueError("Incompatible indexer with DataFrame")
   2202 
   2203 

ValueError: Incompatible indexer with DataFrame

CodePudding user response：

.str.extract returns a dataframe, you may need extract the first column from it

df.loc[m, 'col2'] = df['col1'].str.extract(r'<\s*(\d )')[0]

CodePudding user response：

There is a simple way to keep original column values when Series.str.extract does not find a match: use the .fillna(df[your_new_col_name]).

You can use

df['col2'] = df['col1'].str.extract(r'<\s*(\d )')[0].fillna(df['col2'])

See a Pandas test:

import pandas as pd
d = {'col1': ['62', '4 < s', '5<77', '< 10 '], 'col2': [3, 4, None, 9]}
df = pd.DataFrame(data=d)
df['col2'] = df['col1'].str.extract(r'<\s*(\d )')[0].fillna(df['col2'])

Output:

>>> df
    col1 col2
0     62  3.0
1  4 < s  4.0
2   5<77   77
3  < 10    10

Notes

type(df['col1'].str.extract(r'<\s*(\d )')) outputs <class 'pandas.core.frame.DataFrame'>, you need to extract the first column value by using the [0] index
.fillna(df['col2']) fills the NA values with the original value from col2.