Extract string from Pandas/Python-CodePudding

I am trying to extract string from dataframe.

Raw data (dataframe)
Column1:
CA_1,CA_2,CA_3,BA_1,BA_2,BA_3,CA_4,BA_12,CA_6

Expected Output
Column1:
CA_1,CA_2,CA_3,CA_4,CA_6

How to produce the expected output? I tried pandas extract,extractall,contains but Im unable to produce the expected output.

CodePudding user response：

Looks like you want to remove all items that do not start with CA_?

I would use str.replace with a regex:

df['Column1'].str.replace(r'(?:(?!CA_)\b\w ,?)', '', regex=True)

Alternative with split/explode/groupby.agg:

(df['Column1'].str.split(',').explode()
 .loc[lambda s: s.str.startswith('CA_')]
 .groupby(level=0).agg(','.join)
)

output:

0    CA_1,CA_2,CA_3,CA_4,CA_6
Name: Column1, dtype: object

CodePudding user response：

We could use str.replace here:

df["Column1"] = df["Column1"].str.replace(r'\b(?!CA_)[A-Z] _\d \b,?', '', regex=True)

Here is a regex demo showing that the replacement logic is working.

CodePudding user response：

Try something like this:

data = {
    'col1': ['CA_1','CA_2','CA_3','BA_1','BA_2','BA_3','CA_4','BA_12','CA_6']
}

df = pd.DataFrame(data)
res = [dt for dt in data['col1'] if dt.startswith('CA_')]
print(res)

Result:

['CA_1', 'CA_2', 'CA_3', 'CA_4', 'CA_6']

CodePudding user response：

Considering that OP's dataframe looks like the following

df = pd.DataFrame({'Column1': ['CA_1', 'CA_2', 'CA_3', 'BA_1', 'BA_2', 'BA_3', 'CA_4', 'BA_12', 'CA_6']})

[Out]:

  Column1
0    CA_1
1    CA_2
2    CA_3
3    BA_1
4    BA_2
5    BA_3
6    CA_4
7   BA_12

And assuming that OP wants the dataframe to have only the cells that contain the string CA, one can use str.contains() as follows

df_new = df[df['Column1'].str.contains('CA')]

[Out]:
  Column1
0    CA_1
1    CA_2
2    CA_3
6    CA_4
8    CA_6

Alternatively, one can use str.startswith()

df_new = df[df['Column1'].str.startswith('CA')]

[Out]:
  Column1
0    CA_1
1    CA_2
2    CA_3
6    CA_4
8    CA_6