Using regex to filter pandas dataframe columns with an exception-CodePudding

I'm trying to subset (retrieve a set of rows) a python pandas data frame by using pd.filter with a regex string to identify the columns of interest before performing a subset based on the values in those columns.

For example, this is my mock data frame:

id status status_drug_use drugA drugA_use    drugB  drugB_use
0  1      analgesic       0     None         1      hypertensive
1  0      analgesic       1     analgesic    1      hypertensive
2  0      analgesic       1     hypertensive 0      None
3  1      analgesic       0     None         1      analgesic

I would like all rows that contain the values in columns drugA_use or drugB_use which match the value in status_drug_use. As per the example, this would return the two rows:

id status status_drug_use drugA drugA_use    drugB  drugB_use
1  0      analgesic       1     analgesic    1      hypertensive
3  1      analgesic       0     None         1      analgesic

There are a few column name conventions to stick with:

status_drug_use is always there.
The matching columns (drugA_use and drugB_use) always follow the template <ANYTHING>_use.

Alteration There is a second scenario, one in which I would like to perform a comparison between a user defined string eg analgesic and the two columns drugA_use and drugB_use. This is different from using the content of status_drug_use.

CodePudding user response：

Here's a way to do what you've asked:

df2 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('status_drug_use == all_use').drop_duplicates().drop(columns='all_use')

Input:

  id status status_drug_use drugA     drugA_use drugB     drugB_use
0  0      1       analgesic     0          None     1  hypertensive
1  1      0       analgesic     1     analgesic     1  hypertensive
2  2      0       analgesic     1  hypertensive     0          None
3  3      1       analgesic     0          None     1     analgesic

Output:

  id status status_drug_use drugA  drugA_use drugB     drugB_use
1  1      0       analgesic     1  analgesic     1  hypertensive
3  3      1       analgesic     0       None     1     analgesic

Explanation:

find the subset of all columns ending in _use (excluding status_drug_use)
add a column named all_use whose value for a given row is a list of the values in the columns ending in _use
use explode() to add rows such that for each original row, there are now multiple rows, one for each of the values in all_use for the original row
use query() to select only rows where status_drug_use matches the value in all_use
use drop_duplicates to eliminate rows in case there were multiple matches for any rows in the original dataframe (for example, if both drugA_use and drugB_use contained "analgesic" and so did status_drug_use)
drop the column all_use as we no longer need it.

UPDATE: Addressing OP's question in a comment: 'Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"?'

You can do this by having the user defined query string (call it user_defined_str) as a variable and changing the contents of query() by replacing the column name status_drug_use with the variable name with @ prepended: @user_defined_str (see the query() docs here for more detail).

user_defined_str = 'analgesic'
df3 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('@user_defined_str == all_use').drop_duplicates().drop(columns='all_use')