I'm trying to subset (retrieve a set of rows) a python pandas data frame by using pd.filter with a regex string to identify the columns of interest before performing a subset based on the values in those columns.
For example, this is my mock data frame:
id status status_drug_use drugA drugA_use drugB drugB_use
0 1 analgesic 0 None 1 hypertensive
1 0 analgesic 1 analgesic 1 hypertensive
2 0 analgesic 1 hypertensive 0 None
3 1 analgesic 0 None 1 analgesic
I would like all rows that contain the values in columns drugA_use
or drugB_use
which match the value in status_drug_use
. As per the example, this would return the two rows:
id status status_drug_use drugA drugA_use drugB drugB_use
1 0 analgesic 1 analgesic 1 hypertensive
3 1 analgesic 0 None 1 analgesic
There are a few column name conventions to stick with:
status_drug_use
is always there.- The matching columns (
drugA_use
anddrugB_use
) always follow the template<ANYTHING>_use
.
Alteration
There is a second scenario, one in which I would like to perform a comparison between a user defined string eg analgesic
and the two columns drugA_use
and drugB_use
. This is different from using the content of status_drug_use
.
CodePudding user response:
Here's a way to do what you've asked:
df2 = df.assign(all_use=df.apply(
lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]),
axis=1)).explode(
'all_use').query('status_drug_use == all_use').drop_duplicates().drop(columns='all_use')
Input:
id status status_drug_use drugA drugA_use drugB drugB_use
0 0 1 analgesic 0 None 1 hypertensive
1 1 0 analgesic 1 analgesic 1 hypertensive
2 2 0 analgesic 1 hypertensive 0 None
3 3 1 analgesic 0 None 1 analgesic
Output:
id status status_drug_use drugA drugA_use drugB drugB_use
1 1 0 analgesic 1 analgesic 1 hypertensive
3 3 1 analgesic 0 None 1 analgesic
Explanation:
- find the subset of all columns ending in
_use
(excludingstatus_drug_use
) - add a column named
all_use
whose value for a given row is a list of the values in the columns ending in_use
- use
explode()
to add rows such that for each original row, there are now multiple rows, one for each of the values inall_use
for the original row - use
query()
to select only rows wherestatus_drug_use
matches the value inall_use
- use
drop_duplicates
to eliminate rows in case there were multiple matches for any rows in the original dataframe (for example, if bothdrugA_use
anddrugB_use
contained "analgesic" and so didstatus_drug_use
) - drop the column
all_use
as we no longer need it.
UPDATE: Addressing OP's question in a comment: 'Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"?'
You can do this by having the user defined query string (call it user_defined_str
) as a variable and changing the contents of query()
by replacing the column name status_drug_use
with the variable name with @
prepended: @user_defined_str
(see the query()
docs here for more detail).
user_defined_str = 'analgesic'
df3 = df.assign(all_use=df.apply(
lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]),
axis=1)).explode(
'all_use').query('@user_defined_str == all_use').drop_duplicates().drop(columns='all_use')