I am trying to extract string from dataframe.
Raw data (dataframe)
Column1:
CA_1,CA_2,CA_3,BA_1,BA_2,BA_3,CA_4,BA_12,CA_6
Expected Output
Column1:
CA_1,CA_2,CA_3,CA_4,CA_6
How to produce the expected output? I tried pandas extract,extractall,contains but Im unable to produce the expected output.
CodePudding user response:
Looks like you want to remove all items that do not start with CA_
?
I would use str.replace
with a regex:
df['Column1'].str.replace(r'(?:(?!CA_)\b\w ,?)', '', regex=True)
Alternative with split
/explode
/groupby.agg
:
(df['Column1'].str.split(',').explode()
.loc[lambda s: s.str.startswith('CA_')]
.groupby(level=0).agg(','.join)
)
output:
0 CA_1,CA_2,CA_3,CA_4,CA_6
Name: Column1, dtype: object
CodePudding user response:
We could use str.replace
here:
df["Column1"] = df["Column1"].str.replace(r'\b(?!CA_)[A-Z] _\d \b,?', '', regex=True)
Here is a regex demo showing that the replacement logic is working.
CodePudding user response:
Try something like this:
data = {
'col1': ['CA_1','CA_2','CA_3','BA_1','BA_2','BA_3','CA_4','BA_12','CA_6']
}
df = pd.DataFrame(data)
res = [dt for dt in data['col1'] if dt.startswith('CA_')]
print(res)
Result:
['CA_1', 'CA_2', 'CA_3', 'CA_4', 'CA_6']
CodePudding user response:
Considering that OP's dataframe looks like the following
df = pd.DataFrame({'Column1': ['CA_1', 'CA_2', 'CA_3', 'BA_1', 'BA_2', 'BA_3', 'CA_4', 'BA_12', 'CA_6']})
[Out]:
Column1
0 CA_1
1 CA_2
2 CA_3
3 BA_1
4 BA_2
5 BA_3
6 CA_4
7 BA_12
And assuming that OP wants the dataframe to have only the cells that contain the string CA
, one can use str.contains()
as follows
df_new = df[df['Column1'].str.contains('CA')]
[Out]:
Column1
0 CA_1
1 CA_2
2 CA_3
6 CA_4
8 CA_6
Alternatively, one can use str.startswith()
df_new = df[df['Column1'].str.startswith('CA')]
[Out]:
Column1
0 CA_1
1 CA_2
2 CA_3
6 CA_4
8 CA_6