I have a dataframe such as
COL1 COL2
A Name=canis_lupus3099 HHYUIO jj6§è7
B Name=bomba009 JJIJJ;HHJKN
C Name=Test_test788_eheh;NHHhh
D Name=UYEYEHJ0909EEHH:HEEH Jk G
How can I use regex in order to only keep within COL2 the Name=something part and remove everything after a space of any symbolic point (eg ; or :)
I should then get:
COL1 COL2
A Name=canis_lupus3099
B Name=bomba009
C Name=Test_test788_eheh
D Name=UYEYEHJ0909EEHH
I touth to use something like tab['COL2'].str.replace()
CodePudding user response:
You can use str.extract
:
df['COL2'] = df['COL2'].str.extract(r'^(Name=(?:[^\s;:]) )')
Alternative:
# everything until the first space or ; or :
df['COL2'] = df['COL2'].str.extract(r'^(.*?)(?=[\s;:])')
output:
COL1 COL2
0 A Name=canis_lupus3099
1 B Name=bomba009
2 C Name=Test_test788_eheh
3 D Name=UYEYEHJ0909EEHH
CodePudding user response:
Try split:
df[['COL2','COL3']] = df['COL2'].str.split(" ", 1, expand=True)
after that you can delete COL3
CodePudding user response:
def remove(df, col, symbol):
df[col] = df[col].str.replace(symbol, '')
return df
CodePudding user response:
You can use str.extract
with \w
:
^
- start of string\w
- More that one letters/digits/underscores
>>> df['COL2'].str.extract(r'^(Name=\w )')
1 Name=canis_lupus3099
2 Name=bomba009
3 Name=Test_test788_eheh
4 Name=UYEYEHJ0909EEHH
CodePudding user response:
A way out is to use positive look behind to extract all characters after the start.
df['COL2'].str.strip().str.extract('((?<=^)Name=\w )')