Home > other >  Need to extract data from a column, if a particular character exists, extracting the substring befor
Need to extract data from a column, if a particular character exists, extracting the substring befor

Time:11-17

I've got a column which I am trying to clean, the data is like this:

enter image description here

Wherever the pattern is of x-y year, I want to extract only the 'x' value and leave it in the string. For any other value, I want to keep it as is.

Using str.extract('(.{,2}(-))') is returning a NaN value for all the other rows.

CodePudding user response:

The solution first compiles the regex then the compiled regex will be used on each row. The lambda also relies on the walrus operator :=. Assumes that your 2nd column is named col2.

import re

pattern = re.compile("([\d] )-[\d]  year")
df["col2"] = df["col2"].map(lambda x: m[1] if (m:=pattern.match(x)) else x)

CodePudding user response:

You want series.str.replace(), I believe.

Does this give you the desired output?

df = pd.DataFrame.from_records([[1778, '3-5 year'], [961, np.nan], [2141, 'h 3  year']], columns=['a','b'])

repl = lambda m: m.group(1)
df.b = df.b.str.replace(r'(\d )-\d \syear', repl, regex=True)
df

which takes the original df:

      a          b
0  1778   3-5 year
1   961        NaN
2  2141  h 3  year

and gives the output:

      a          b
0  1778          3
1   961        NaN
2  2141  h 3  year
  • Related