i have this dataframe with 1 column in python and i need to extract out the 2 numbers that come after a "#" character or the string "door". Example of this would be
String_column
#12-123,456
mom101, door 101, pop10
i only want the 2 numbers that come after the # sign or the word door. how would i go about doing this. This is what i currently have but i think this only takes in the numbers that come after the # key
import pandas as pd
df = pd.read_csv(data.csv)
df['qwerty'] = df.string_column.str.extract(
r'(?<=#)(\d )', expand=False
).fillna(0).astype(int)
CodePudding user response:
You can use df.loc
combined with apply
which will get all the indexes that are true.
Here is a simple example
In [5]: df= pd.DataFrame({'String_column':['not useful', 'door useful1', '! useful 2', 'not useful']})
In [6]: df
Out[6]:
String_column
0 not useful
1 door useful1
2 ! useful 2
3 not useful
Now using our function
In [7]: df.loc[df['String_column'].apply(lambda x: True if x.startswith('!') or x.startswith('door') else False)]
Out[7]:
String_column
1 door useful1
2 ! useful 2
We used startswith to match all our conditions to get the useful values that starts with '!' or 'door'.
CodePudding user response:
IIUC, you can use a non capturing group to list your different options (#
or door\s*
):
df['num'] = (df['String_column'].str.extract(r'(?:#|door\s*)(\d )', expand=False)
.fillna(0).astype(int)
)
output:
String_column num
0 #12-123,456 12
1 mom101, door 101, pop10 101