I've got a column which I am trying to clean, the data is like this:
Wherever the pattern is of x-y year, I want to extract only the 'x' value and leave it in the string. For any other value, I want to keep it as is.
Using str.extract('(.{,2}(-))') is returning a NaN value for all the other rows.
CodePudding user response:
The solution first compiles the regex then the compiled regex will be used on each row.
The lambda also relies on the walrus operator :=
.
Assumes that your 2nd column is named col2
.
import re
pattern = re.compile("([\d] )-[\d] year")
df["col2"] = df["col2"].map(lambda x: m[1] if (m:=pattern.match(x)) else x)
CodePudding user response:
You want series.str.replace()
, I believe.
Does this give you the desired output?
df = pd.DataFrame.from_records([[1778, '3-5 year'], [961, np.nan], [2141, 'h 3 year']], columns=['a','b'])
repl = lambda m: m.group(1)
df.b = df.b.str.replace(r'(\d )-\d \syear', repl, regex=True)
df
which takes the original df
:
a b
0 1778 3-5 year
1 961 NaN
2 2141 h 3 year
and gives the output:
a b
0 1778 3
1 961 NaN
2 2141 h 3 year