I want to transform my dataset such way:
Input:
id | drug1_drug | drug2_drug | drug3_drug | age |
---|---|---|---|---|
1 | 0 | 1 | 0 | 33 |
1 | 0 | 1 | 0 | 33 |
1 | 0 | 1 | 0 | 33 |
2 | 0 | 1 | 1 | 45 |
2 | 0 | 1 | 1 | 45 |
3 | 1 | 1 | 0 | 66 |
3 | 1 | 1 | 0 | 66 |
3 | 1 | 1 | 0 | 66 |
4 | 1 | 0 | 1 | 28 |
4 | 1 | 0 | 1 | 28 |
5 | 1 | 0 | 0 | 87 |
5 | 1 | 0 | 0 | 87 |
Output:
id | therapy |
---|---|
1 | Age:33,Therapy: drug2 |
2 | Age:45, Therapy:drug2,drug3 |
3 | Age:66, Therapy:drug1,drug2 |
4 | Age:28, Therapy:drug1,drug3 |
5 | Age:87, Therapy:drug1 |
Column names with _drug are constant for a particular id. And i need if it '1' put the name of the drug to the new table. And i need that each "drug" will be on the new line in a table. Iam not so good at pandas but i want to fix it, not to ask questions on stackoverflow)). I would be very grateful if you can advise me some courses for pandas library to understand how to use functions and to transform datasets!
CodePudding user response:
You can reshape with melt
, then filter the values with query
, groupby
ti to aggregate the unique drugs, and finally concatenate the strings to form the output:
df2 = (df.copy()
.melt(['id', 'age'], var_name='drug') # reshape to long format
.query('value == 1') # filter values
.assign(drug=lambda d: d['drug'] # remove "_drug"
.str.replace('_drug', '',
regex=False))
.groupby(['id', 'age'])['drug'] # aggregate unique
.agg(lambda g: ','.join(dict.fromkeys(g))) # drugs per id
.reset_index()
# concatenate strings
.assign(therapy=lambda d: 'Age: ' d.pop('age').astype(str)
', Therapy: ' d.pop('drug'))
)
output:
id therapy
0 1 Age: 33, Therapy: drug2
1 2 Age: 45, Therapy: drug2,drug3
2 3 Age: 66, Therapy: drug1,drug2
3 4 Age: 28, Therapy: drug1,drug3
4 5 Age: 87, Therapy: drug1
If you want to learn pandas, read the doc for all the functions used above and test the provided examples. That should be a great startup!
CodePudding user response:
As mentionned above, you ca start with any pandas course in youtube (there are many decent ones!) and you can also check the pandas official documentation. You should practice a lot and allow yourself to make errors, it ll be easier through time. For your problem, here's a simple answer:
import pandas as pd
# dont mind this, its just to have a mock example of your data.
df = pd.read_csv('data.csv', sep='\t', index_col=0)
# generating the present drugs combinations
drugs = [' '.join(['drug1'*el[0], 'drug2'*el[1], 'drug3'*el[2]]) for el in df[['drug1_drug', 'drug2_drug', 'drug3_drug']].values]
# creating the new df column with the desired output
df['therapy'] = [f"Age:{age},Therapy: {drug}" for age, drug in zip(df['age'], drugs)]
# printing the result
print(df)
output:
drug1_drug drug2_drug drug3_drug age therapy
id
1 0 1 0 33 Age:33,Therapy: drug2
1 0 1 0 33 Age:33,Therapy: drug2
1 0 1 0 33 Age:33,Therapy: drug2
2 0 1 1 45 Age:45,Therapy: drug2 drug3
2 0 1 1 45 Age:45,Therapy: drug2 drug3
3 1 1 0 66 Age:66,Therapy: drug1 drug2
3 1 1 0 66 Age:66,Therapy: drug1 drug2