The "released" column in my df has dates like "June 20, 1980 (United States). The "year" column in the same df has only years like "1980".
But some of the rows have years in the "year" column, which do not correspond to the years in the "released" column for the same row.
I used the code below to create a new "year" column:
df['yearcorrect'] = df['released'].astype(str).str[6:11]
df
This runs but does not return the year. It returns info like "198", "19" etc.
I hope the problem is clear enough. kindly help.
CodePudding user response:
As suggested in the comment - split the string by whitespace and take the third item. Assuming that your data looks like this:
df = pd.DataFrame({'released': ['June 20, 1980 (United States)','May 18, 1970 (Sweden)']})
df['year'] = df['released'].apply(lambda x: x.split()[2])
CodePudding user response:
Assuming that your "released" column has always the same structure, you can use:
import re
df['yearcorrect'] = df.apply(lambda x: re.sub('[^0-9]', "", x.released.split(",")[1]), axis = 1)
which extracts only the numeric characters in the part of the string after the comma.