Home > Net >  Get some words from a string until there is a vast pattern in pandas
Get some words from a string until there is a vast pattern in pandas

Time:03-07

I have a pandas dataframe and one of the columns is a string. I only want the first words from that column that are in front of a date (also in string form). The problem is that I don't know how much words there are in front of the date.

The string rows of the column looks like the following:

word1 word2 word3 02/08/2022 XXX XXX XXX
word1 04/09/2019 XXX XXX XXX
word1 word2 word3 word4 10/12/2021 XXX XXX XXX
word1 word2 30/11/2022 XXX XXX XXX

So I want only:

word1 word2 word3
word1
word1 word2 word3 word4
word1 word2

The 'XXX' stands for words of which I do not know in advance how many there are.

Can someone help me with this problem?

CodePudding user response:

import re

example_string = 'word1 word2 word3 02/08/2022 XXX XXX XXX'
match = re.search(r'(\d /\d /\d )',example_string)

desired_string = example_string.split(match.group(1))[0]

output: word1 word2 word3

CodePudding user response:

You can use str.extract, this avoids parsing the rest of the string and will stop as soon as the date is reached:

df['words'] = df['col'].str.extract(r'(.*)\s \d{2}/\d{2}/\d{4}', expand=False)

output:

                                              col                     word
0        word1 word2 word3 02/08/2022 XXX XXX XXX        word1 word2 word3
1                    word1 04/09/2019 XXX XXX XXX                    word1
2  word1 word2 word3 word4 10/12/2021 XXX XXX XXX  word1 word2 word3 word4
3              word1 word2 30/11/2022 XXX XXX XXX              word1 word2

CodePudding user response:

We can use Series.str.split with a regex pattern

s = pd.Series(["word1 word2 word3 02/08/2022 XXX XXX XXX", "word1 04/09/2019 XXX XXX XXX"])

s.str.split("\d{2}/\d{2}/\d{4}").str[0]

0    word1 word2 word3 
1                word1 
dtype: object
  • Related