Home > Blockchain >  I need to transfom the dataset and learn to do it by myself
I need to transfom the dataset and learn to do it by myself

Time:08-27

I want to transform my dataset such way:

Input:

id drug1_drug drug2_drug drug3_drug age
1 0 1 0 33
1 0 1 0 33
1 0 1 0 33
2 0 1 1 45
2 0 1 1 45
3 1 1 0 66
3 1 1 0 66
3 1 1 0 66
4 1 0 1 28
4 1 0 1 28
5 1 0 0 87
5 1 0 0 87

Output:

id therapy
1 Age:33,Therapy: drug2
2 Age:45, Therapy:drug2,drug3
3 Age:66, Therapy:drug1,drug2
4 Age:28, Therapy:drug1,drug3
5 Age:87, Therapy:drug1

Column names with _drug are constant for a particular id. And i need if it '1' put the name of the drug to the new table. And i need that each "drug" will be on the new line in a table. Iam not so good at pandas but i want to fix it, not to ask questions on stackoverflow)). I would be very grateful if you can advise me some courses for pandas library to understand how to use functions and to transform datasets!

CodePudding user response:

You can reshape with melt, then filter the values with query, groupby ti to aggregate the unique drugs, and finally concatenate the strings to form the output:

df2 = (df.copy()
   .melt(['id', 'age'], var_name='drug')  # reshape to long format
   .query('value == 1')              # filter values
   .assign(drug=lambda d: d['drug']       # remove "_drug"
                          .str.replace('_drug', '',
                                       regex=False))
   .groupby(['id', 'age'])['drug']             # aggregate unique
   .agg(lambda g: ','.join(dict.fromkeys(g)))  # drugs per id
   .reset_index()
   # concatenate strings
   .assign(therapy=lambda d: 'Age: ' d.pop('age').astype(str)
                             ', Therapy: ' d.pop('drug'))
)

output:

   id                        therapy
0   1        Age: 33, Therapy: drug2
1   2  Age: 45, Therapy: drug2,drug3
2   3  Age: 66, Therapy: drug1,drug2
3   4  Age: 28, Therapy: drug1,drug3
4   5        Age: 87, Therapy: drug1

If you want to learn pandas, read the doc for all the functions used above and test the provided examples. That should be a great startup!

CodePudding user response:

As mentionned above, you ca start with any pandas course in youtube (there are many decent ones!) and you can also check the pandas official documentation. You should practice a lot and allow yourself to make errors, it ll be easier through time. For your problem, here's a simple answer:

import pandas as pd

# dont mind this, its just to have a mock example of your data.
df = pd.read_csv('data.csv', sep='\t', index_col=0)

# generating the present  drugs combinations 
drugs = [' '.join(['drug1'*el[0], 'drug2'*el[1], 'drug3'*el[2]]) for el in df[['drug1_drug', 'drug2_drug', 'drug3_drug']].values]

# creating the new df column with the desired output
df['therapy'] = [f"Age:{age},Therapy: {drug}" for age, drug in zip(df['age'], drugs)]

# printing the result
print(df)

output:

    drug1_drug  drug2_drug  drug3_drug  age                       therapy
id                                                                       
1            0           1           0   33       Age:33,Therapy:  drug2 
1            0           1           0   33       Age:33,Therapy:  drug2 
1            0           1           0   33       Age:33,Therapy:  drug2 
2            0           1           1   45  Age:45,Therapy:  drug2 drug3
2            0           1           1   45  Age:45,Therapy:  drug2 drug3
3            1           1           0   66  Age:66,Therapy: drug1 drug2 
3            1           1           0   66  Age:66,Therapy: drug1 drug2 
  • Related