I am trying to stack this table based on ID column but only considering columns [A-D] where the value is 1 and not 0.
Current df:
ID | A | B | C | D |
---|---|---|---|---|
1 | 1 | 0 | 0 | 1 |
3 | 0 | 1 | 0 | 1 |
7 | 1 | 0 | 1 | 1 |
8 | 1 | 0 | 0 | 0 |
What I want:
ID | LETTER |
---|---|
1 | A |
1 | D |
3 | B |
3 | D |
7 | A |
7 | C |
7 | D |
8 | A |
The following code works but I need a more efficient solution as I have a df with 93434 rows x 12377 columns.
stacked_df = df.set_index('ID').stack().reset_index(name='has_letter').rename(columns={'level_1':'LETTER'})
stacked_df = stacked_df[stacked_df['has_letter']==1].reset_index(drop=True)
stacked_df.drop(['has_letter'], axis=1, inplace=True)
Thank you in advance!
CodePudding user response:
You can mask the non-1 values and stack
to remove the NaNs:
df2 = df.rename_axis(columns='LETTERS').set_index('ID')
stacked_df = (df2.where(df2.eq(1)).stack()
.reset_index().iloc[:,:2]
)
Output:
ID LETTERS
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
CodePudding user response:
Try:
print(
df.set_index("ID")
.apply(lambda x: x.index[x == 1], axis=1)
.reset_index()
.explode(0)
.rename(columns={0: "LETTERS"})
)
Prints:
ID LETTERS
0 1 A
0 1 D
1 3 B
1 3 D
2 7 A
2 7 C
2 7 D
3 8 A
Or:
x = df.set_index("ID").stack()
print(
x[x == 1]
.reset_index()
.drop(columns=0)
.rename(columns={"level_1": "LETTER"})
)
Prints:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
CodePudding user response:
Try this
(df.set_index('ID').dot(df.columns[1:]) # use inner product of column names and values
.apply(list) # separate each letter
.explode() # explode each list
.reset_index(name='LETTER') # reset index for df
)
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A