Home > Enterprise >  Remove everything after multiple symbols within a column in pandas
Remove everything after multiple symbols within a column in pandas

Time:06-16

I have a dataframe such as

COL1 COL2
A    Name=canis_lupus3099 HHYUIO jj6§è7
B    Name=bomba009 JJIJJ;HHJKN
C    Name=Test_test788_eheh;NHHhh
D    Name=UYEYEHJ0909EEHH:HEEH Jk G

How can I use regex in order to only keep within COL2 the Name=something part and remove everything after a space of any symbolic point (eg ; or :)

I should then get:

COL1 COL2
A    Name=canis_lupus3099
B    Name=bomba009
C    Name=Test_test788_eheh
D    Name=UYEYEHJ0909EEHH

I touth to use something like tab['COL2'].str.replace()

CodePudding user response:

You can use str.extract:

df['COL2'] = df['COL2'].str.extract(r'^(Name=(?:[^\s;:]) )')

Alternative:

# everything until the first space or ; or :
df['COL2'] = df['COL2'].str.extract(r'^(.*?)(?=[\s;:])')

output:

  COL1                    COL2
0    A    Name=canis_lupus3099
1    B           Name=bomba009
2    C  Name=Test_test788_eheh
3    D    Name=UYEYEHJ0909EEHH

CodePudding user response:

Try split:

df[['COL2','COL3']] = df['COL2'].str.split(" ", 1, expand=True)

after that you can delete COL3

CodePudding user response:

def remove(df, col, symbol):
    df[col] = df[col].str.replace(symbol, '')
    return df

CodePudding user response:

You can use str.extract with \w :

  1. ^ - start of string
  2. \w - More that one letters/digits/underscores
>>> df['COL2'].str.extract(r'^(Name=\w )')
1   Name=canis_lupus3099
2   Name=bomba009
3   Name=Test_test788_eheh
4   Name=UYEYEHJ0909EEHH

CodePudding user response:

A way out is to use positive look behind to extract all characters after the start.

df['COL2'].str.strip().str.extract('((?<=^)Name=\w )')
  • Related