I need to transfom the dataset and learn to do it by myself-CodePudding

I want to transform my dataset such way:

Input:

id	drug1_drug	drug2_drug	drug3_drug	age
1	0	1	0	33
1	0	1	0	33
1	0	1	0	33
2	0	1	1	45
2	0	1	1	45
3	1	1	0	66
3	1	1	0	66
3	1	1	0	66
4	1	0	1	28
4	1	0	1	28
5	1	0	0	87
5	1	0	0	87

Output:

id	therapy
1	Age:33,Therapy: drug2
2	Age:45, Therapy:drug2,drug3
3	Age:66, Therapy:drug1,drug2
4	Age:28, Therapy:drug1,drug3
5	Age:87, Therapy:drug1

Column names with _drug are constant for a particular id. And i need if it '1' put the name of the drug to the new table. And i need that each "drug" will be on the new line in a table. Iam not so good at pandas but i want to fix it, not to ask questions on stackoverflow)). I would be very grateful if you can advise me some courses for pandas library to understand how to use functions and to transform datasets!

CodePudding user response：

You can reshape with melt, then filter the values with query, groupby ti to aggregate the unique drugs, and finally concatenate the strings to form the output:

df2 = (df.copy()
   .melt(['id', 'age'], var_name='drug')  # reshape to long format
   .query('value == 1')              # filter values
   .assign(drug=lambda d: d['drug']       # remove "_drug"
                          .str.replace('_drug', '',
                                       regex=False))
   .groupby(['id', 'age'])['drug']             # aggregate unique
   .agg(lambda g: ','.join(dict.fromkeys(g)))  # drugs per id
   .reset_index()
   # concatenate strings
   .assign(therapy=lambda d: 'Age: ' d.pop('age').astype(str)
                             ', Therapy: ' d.pop('drug'))
)

output:

   id                        therapy
0   1        Age: 33, Therapy: drug2
1   2  Age: 45, Therapy: drug2,drug3
2   3  Age: 66, Therapy: drug1,drug2
3   4  Age: 28, Therapy: drug1,drug3
4   5        Age: 87, Therapy: drug1

If you want to learn pandas, read the doc for all the functions used above and test the provided examples. That should be a great startup!

CodePudding user response：

As mentionned above, you ca start with any pandas course in youtube (there are many decent ones!) and you can also check the pandas official documentation. You should practice a lot and allow yourself to make errors, it ll be easier through time. For your problem, here's a simple answer:

import pandas as pd

# dont mind this, its just to have a mock example of your data.
df = pd.read_csv('data.csv', sep='\t', index_col=0)

# generating the present  drugs combinations 
drugs = [' '.join(['drug1'*el[0], 'drug2'*el[1], 'drug3'*el[2]]) for el in df[['drug1_drug', 'drug2_drug', 'drug3_drug']].values]

# creating the new df column with the desired output
df['therapy'] = [f"Age:{age},Therapy: {drug}" for age, drug in zip(df['age'], drugs)]

# printing the result
print(df)

output:

    drug1_drug  drug2_drug  drug3_drug  age                       therapy
id                                                                       
1            0           1           0   33       Age:33,Therapy:  drug2 
1            0           1           0   33       Age:33,Therapy:  drug2 
1            0           1           0   33       Age:33,Therapy:  drug2 
2            0           1           1   45  Age:45,Therapy:  drug2 drug3
2            0           1           1   45  Age:45,Therapy:  drug2 drug3
3            1           1           0   66  Age:66,Therapy: drug1 drug2 
3            1           1           0   66  Age:66,Therapy: drug1 drug2

id	drug1_drug	drug2_drug	drug3_drug	age
1	0	1	0	33
1	0	1	0	33
1	0	1	0	33
2	0	1	1	45
2	0	1	1	45
3	1	1	0	66
3	1	1	0	66
3	1	1	0	66
4	1	0	1	28
4	1	0	1	28
5	1	0	0	87
5	1	0	0	87

id	drug1_drug	drug2_drug	drug3_drug	age
1	0	1	0	33
1	0	1	0	33
1	0	1	0	33
2	0	1	1	45
2	0	1	1	45
3	1	1	0	66
3	1	1	0	66
3	1	1	0	66
4	1	0	1	28
4	1	0	1	28
5	1	0	0	87
5	1	0	0	87

id	drug1_drug	drug2_drug	drug3_drug	age
1	0	1	0	33
1	0	1	0	33
1	0	1	0	33
2	0	1	1	45
2	0	1	1	45
3	1	1	0	66
3	1	1	0	66
3	1	1	0	66
4	1	0	1	28
4	1	0	1	28
5	1	0	0	87
5	1	0	0	87