Home > database >  Remove leading strings in a dataframe
Remove leading strings in a dataframe

Time:02-25

I am studying other people's code df and I face a similar problem to this where the data is joined whatsoever:

Names
--------
NurseJohn
SoldierJohn
TeacherJohn
DriverJohn
CEOJohn

How can I remove the words before John?

It can be removed like this but I don't understand how it was removed

df['Names'] = df['Names'].str.replace(".*(?=John)", "", regex=True)

Can someone explain to us what happened in (".*(?=John)", "", regex=True)? and with that, is there other way to do this straightforwardly?

CodePudding user response:

Actually, the regex pattern you should have used is:

.*(?=John$)

This pattern says to match all content, greedily, until hitting the content John at the very end of the Names column. Note that it does not consume John, it only asserts that it follows, before stopping the match.

Your updated code:

df["Names"] = df["Names"].str.replace(r'.*(?=John$)', '')

CodePudding user response:

ya so...your using regex...regex is a tool ever lang ive worked with uses to search strings(text). Regex = Regular Expression. next you are using regex to exclude anything before "John", then replace with "" witch is an empty string.

so to read it from left to right:

  1. call dataframe col 'Names'
  2. for string in col, replace ALL(*) before "John" with empty string(""), using regex
  • Related