Remove everything after second caret regex and apply to pandas dataframe column-CodePudding

I have a dataframe with a column that looks like this:

0         EIAB^EIAB^6
1           8W^W844^A
2           8W^W844^A
3           8W^W858^A
4           8W^W844^A
             ...     
826136    EIAB^EIAB^6
826137    SICU^6124^A
826138    SICU^6124^A
826139    SICU^6128^A
826140    SICU^6128^A

I just want to keep everything before the second caret, e.g.: 8W^W844, what regex would I use in Python? Similarly PACU^SPAC^06 would be PACU^SPAC. And to apply it to the whole column.

I tried r'[\\^]. $' since I thought it would take the last caret and everything after, but it didn't work.

CodePudding user response：

You can negate the character group to find everything except ^ and put it in a match group. you don't need to escape the ^ in the character group but you do need to escape the one outside.

re.match(r"([^^] \^[^^] )", "8W^W844^A").group(1)

This is quite useful in a pandas dataframe. Match the entire string so that it can all be replaced with match group 1.

df.replace(r"([^^] \^[^^] ).*", r"\1", regex=True)

CodePudding user response：

I don't think regex is really necessary here, just slice the string up to the position of the second caret:

>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^")   1)]
'PACU^SPAC'

Explanation: str.find accepts a second argument of where to start the search, place it just after the position of the first caret.