Given a dataframe A
that looks like this:
id | information |
---|---|
001 | Yellow, in town, John |
002 | Green, home, Lia 33 |
003 | Yellow, garden, Peter2543 |
004 | Red, 23 garden, 004 John891 |
005 | Red, home, 245Sarah |
006 | Red 2, park 28, 67 Luke |
007 | Purple 03, to the beach, Mary Rose 9855 |
... | ... |
I want to create a new column called name
by extracting the name from information
, without numbers. That is:
id | information | name |
---|---|---|
001 | Yellow, in town, John | John |
002 | Green, wardrobe, home, Lia 33 | Lia |
003 | Yellow, garden, Peter2543 | Peter |
004 | Red, 23 garden, 004 John891 | John |
005 | Red, hat, home, 245Sarah | Sarah |
006 | Red 2, park 28, 67 Luke | Luke |
007 | Purple 03, to the beach, Mary Rose 9855 | Mary Rose |
... | ... | ... |
Notice that:
- the name in
information
is always after the last comma - the name may have numbers after and/or before the name
- these numbers can be separated by a white space or not
- the name can be two words (i.e. Mary Rose)
If I do:
A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)
it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,...). But I need to only get the name.
I guess I have to use re.split()
instead but I cannot figure out which should be the regex expression...
CodePudding user response:
You can use
import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,\s*(?:\d \s*)?([^\d,] ?)(?:\s*\d )?$', expand=False)
Output:
>>> df['information'].str.extract(r'.*,\s*(?:\d \s*)?([^\d,] ?)(?:\s*\d )?$', expand=False)
0 John
1 Lia
2 Peter
3 John
4 Sarah
5 Luke
6 Mary Rose
Name: information, dtype: object
Details:
.*,
- any zero or more chars other than line break chars as many as possible, and then a,
char\s*
- zero or more whitespaces(?:\d \s*)?
- an optional sequence of one or more digits and then zero or more whitespaces([^\d,] ?)
- Group 1: one or more chars other than digits and comma, as few as possible(?:\s*\d )?
- an optional sequence of zero or more whitespaces and then one or more digits$
- end of string.