I have a pandas dataframe and one of the columns is a string. I only want the first words from that column that are in front of a date (also in string form). The problem is that I don't know how much words there are in front of the date.
The string rows of the column looks like the following:
word1 word2 word3 02/08/2022 XXX XXX XXX
word1 04/09/2019 XXX XXX XXX
word1 word2 word3 word4 10/12/2021 XXX XXX XXX
word1 word2 30/11/2022 XXX XXX XXX
So I want only:
word1 word2 word3
word1
word1 word2 word3 word4
word1 word2
The 'XXX' stands for words of which I do not know in advance how many there are.
Can someone help me with this problem?
CodePudding user response:
import re
example_string = 'word1 word2 word3 02/08/2022 XXX XXX XXX'
match = re.search(r'(\d /\d /\d )',example_string)
desired_string = example_string.split(match.group(1))[0]
output: word1 word2 word3
CodePudding user response:
You can use str.extract
, this avoids parsing the rest of the string and will stop as soon as the date is reached:
df['words'] = df['col'].str.extract(r'(.*)\s \d{2}/\d{2}/\d{4}', expand=False)
output:
col word
0 word1 word2 word3 02/08/2022 XXX XXX XXX word1 word2 word3
1 word1 04/09/2019 XXX XXX XXX word1
2 word1 word2 word3 word4 10/12/2021 XXX XXX XXX word1 word2 word3 word4
3 word1 word2 30/11/2022 XXX XXX XXX word1 word2
CodePudding user response:
We can use Series.str.split
with a regex pattern
s = pd.Series(["word1 word2 word3 02/08/2022 XXX XXX XXX", "word1 04/09/2019 XXX XXX XXX"])
s.str.split("\d{2}/\d{2}/\d{4}").str[0]
0 word1 word2 word3
1 word1
dtype: object