Home > Back-end >  Melt dataframe based on condition
Melt dataframe based on condition

Time:11-24

d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df = pd.DataFrame(d)

Current melt function is:

df2 = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Bool')
df2 = df2.query('Bool == True')

Is there a way to incorporate that 'True' condition in the melt function. As I continue to add entries to my df and I have hundreds of columns, I assume it's much less costly to pull only the values I need instead of melting the entire df and then filtering. Any ideas?

CodePudding user response:

Use pd.melt instead. Factor in replacement of False with NaN and dropna() eventually.

pd.melt(df.replace(False, np.nan), id_vars=['key'],var_name = 'letter', value_name = 'Bool').dropna()



    key letter  Bool
0    1      a  True
1    2      a  True
5    3      b  True

CodePudding user response:

You can filter the non key cols first, melt the results and concat the melted rows back. See the following;

import pandas as pd
import numpy as np
import time

d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df = pd.DataFrame(d)

start_time = time.time()
key_column_name = 'key'
key_column_loc = list(df.columns).index(key_column_name)
filtered_frame = None
for letter in [s for s in list(df.columns) if s != key_column_name]:
  true_booleans = np.nonzero(df[letter].values)[0]
  melted_df = df.iloc[true_booleans][[key_column_name, letter]].reset_index(drop=True).melt(id_vars=[key_column_name], var_name = 'letter', value_name = 'Bool')
  if filtered_frame is None:
    filtered_frame = melted_df
  else:
    filtered_frame = pd.concat((filtered_frame, melted_df), axis = 0)
end_time = time.time()
print(filtered_frame, '\n\n', end_time - start_time, 'seconds!')

Output

   key letter  Bool
0    1      a  True
1    2      a  True
0    3      b  True 

 0.011133432388305664 seconds!

Compared to your code, it is slower (your score is 0.008090734481811523 seconds!), however as the rows increase, I would expect that above way of doing it will be more efficient. Looking forward for the results.

CodePudding user response:

The melt and filter step is efficient though, I'd probably stick with loc instead of query, especially if your data is not that large (<200_000 rows)

Another option is to skip melt, use numpy, and build a new dataframe:

box = df.iloc[:, 1:]
len_df = len(df)
letters = np.tile(box.columns, (len_df,1))[box]
pd.DataFrame({'key':df.key.array, 
              'letter' : letters, 
              'Bool' : [True]*len_df})

   key letter  Bool
0    1      a  True
1    2      a  True
2    3      b  True

CodePudding user response:

melt moves column data and stacks it vertically resulting in two columns: the variable name of the column being stacked and the value column name.

d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True],'c':['Batchelor','Masters','Doctorate']}
df = pd.DataFrame(d)


df2 = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Boolean')

df2=df2.drop(df2.index[df2['Boolean'] == False])

print(df2)

output

  key letter    Boolean
0    1      a       True
1    2      a       True
5    3      b       True
6    1      c  Batchelor
7    2      c    Masters
8    3      c  Doctorate
​
  • Related