Create new dataframes for each unique value in a column in pandas-CodePudding

I have a presence/absence dataframe that looks like this (it's much larger but have reduced it for this question):

annotations   factor1   factor2   factor3   Class
heroine       1         0         1         OPIOID_TYPE
he smokes     0         1         0         OTHER_DRUG_USE
heroin        1         0         1         OPIOID_TYPE

What I would like to do is create a new dataframe for each unique value in 'Class' and insert each value in class as the name of the last column for each dataframe and record presence/absence.

In other words:

annotations   factor1   factor2   factor3   OPIOID_TYPE
heroine       1         0         1         1
he smokes     0         1         0         0
heroin        1         0         1         1

and:

annotations   factor1   factor2   factor3   OTHER_DRUG_USE
heroine       1         0         1         0
he smokes     0         1         0         1
heroin        1         0         1         0

In reality, my dataframe is much larger with 2289 rows and 1273 columns and exactly 23 unique values in 'Class' for a total of 23 new dataframes.

I assume a loop structure would work here but I have limited experience with python looping.

CodePudding user response：

You can iterate over your Class values:

dfs = {}
for klass in df['Class'].unique():
    dfs[klass] = df.assign(**{klass: df['Class'].eq(klass).astype(int)}) \
                   .drop(columns='Class')

Now you have a dict indexed by Class values:

>>> dfs.keys()
dict_keys(['OPIOID_TYPE', 'OTHER_DRUG_USE'])

>>> dfs['OPIOID_TYPE']
  annotations  factor1  factor2  factor3  OPIOID_TYPE
0     heroine        1        0        1            1
1   he smokes        0        1        0            0
2      heroin        1        0        1            1

>>> dfs['OTHER_DRUG_USE']
  annotations  factor1  factor2  factor3  OTHER_DRUG_USE
0     heroine        1        0        1               0
1   he smokes        0        1        0               1
2      heroin        1        0        1               0

The following is strongly discouraged

Now if you really want real python variables, you can use locals() to create them dynamically:

for idx, klass in enumerate(df['Class'].unique(), 1):
    print(f"df{idx} is for '{klass}' class")
    locals()[f"df{idx}"] = df.assign(**{klass: df['Class'].eq(klass).astype(int)}) \
                             .drop(columns='Class')

# Output:
df1 is for 'OPIOID_TYPE' class
df2 is for 'OTHER_DRUG_USE' class

Output:

>>> df1
  annotations  factor1  factor2  factor3  OPIOID_TYPE
0     heroine        1        0        1            1
1   he smokes        0        1        0            0
2      heroin        1        0        1            1

>>> df2
  annotations  factor1  factor2  factor3  OTHER_DRUG_USE
0     heroine        1        0        1               0
1   he smokes        0        1        0               1
2      heroin        1        0        1               0

CodePudding user response：

We can do get_dummies and save the dfs into dict

s = df.pop('Class').str.get_dummies()
d = {x : df.join(s[[x]]) for x in s}

Example output below

d['OPIOID_TYPE']
Out[43]: 
  annotations  factor1  factor2  factor3  OPIOID_TYPE
0     heroine        1        0        1            1
1    hesmokes        0        1        0            0
2      heroin        1        0        1            1
d['OTHER_DRUG_USE']
Out[44]: 
  annotations  factor1  factor2  factor3  OTHER_DRUG_USE
0     heroine        1        0        1               0
1    hesmokes        0        1        0               1
2      heroin        1        0        1               0