I have a dataframe containing a column of strings. I want to take out a part of each string in each row, which is the year and then create a new column and assign it to that column. My problem is to isolate the last part of the string. An example could be: 'TON GFR 2018 N' For this string I would be able to execute by running one of the following (For this I want to isolate 18 and not 2018).
new_data['Year'] = pd.DataFrame([str(ele[1])[:2] for ele in list(new_data['Name'].str.split('20'))])
new_data['Year'] = new_data['Name'].str.split('20').str[1]
new_data['Year'] = new_data['Year'].str[:2]
However, I also meet names like these: 'TON RO20 2018 N' or TON 2020 N and then it does not work. I also encounter different number of spaces in different rows in the dataframe, hence it does not work to count the number of spaces in the string.
Any smart solutions to my problem?
CodePudding user response:
Use .str.extract()
to extract 4 digits string starting with 20
and get the last 2 digits, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'20(\d\d)')
If you want to ensure the 4-digit string is not part of a longer string/number, you can further use regex meta-character \b
(word boundary) to enclose the target strings, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'\b20(\d\d)\b')
Demo
Input data:
print(new_data)
Name
0 TON GFR 2018 N
1 TON RO20 2018 N
2 TON 2020 N
Result:
print(new_data)
Name Year
0 TON GFR 2018 N 18
1 TON RO20 2018 N 18
2 TON 2020 N 20
CodePudding user response:
if this is all the time the same distance from the end you could use:
new_data["Year"] = new_data["Name"].str.slice(start=-4, stop=-2)