d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df = pd.DataFrame(d)
Current melt function is:
df2 = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Bool')
df2 = df2.query('Bool == True')
Is there a way to incorporate that 'True' condition in the melt function. As I continue to add entries to my df and I have hundreds of columns, I assume it's much less costly to pull only the values I need instead of melting the entire df and then filtering. Any ideas?
CodePudding user response:
Use pd.melt instead. Factor in replacement of False with NaN and dropna() eventually.
pd.melt(df.replace(False, np.nan), id_vars=['key'],var_name = 'letter', value_name = 'Bool').dropna()
key letter Bool
0 1 a True
1 2 a True
5 3 b True
CodePudding user response:
You can filter the non key cols first, melt the results and concat the melted rows back. See the following;
import pandas as pd
import numpy as np
import time
d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df = pd.DataFrame(d)
start_time = time.time()
key_column_name = 'key'
key_column_loc = list(df.columns).index(key_column_name)
filtered_frame = None
for letter in [s for s in list(df.columns) if s != key_column_name]:
true_booleans = np.nonzero(df[letter].values)[0]
melted_df = df.iloc[true_booleans][[key_column_name, letter]].reset_index(drop=True).melt(id_vars=[key_column_name], var_name = 'letter', value_name = 'Bool')
if filtered_frame is None:
filtered_frame = melted_df
else:
filtered_frame = pd.concat((filtered_frame, melted_df), axis = 0)
end_time = time.time()
print(filtered_frame, '\n\n', end_time - start_time, 'seconds!')
Output
key letter Bool
0 1 a True
1 2 a True
0 3 b True
0.011133432388305664 seconds!
Compared to your code, it is slower (your score is 0.008090734481811523 seconds!), however as the rows increase, I would expect that above way of doing it will be more efficient. Looking forward for the results.
CodePudding user response:
The melt
and filter
step is efficient though, I'd probably stick with loc
instead of query
, especially if your data is not that large (<200_000 rows)
Another option is to skip melt
, use numpy
, and build a new dataframe:
box = df.iloc[:, 1:]
len_df = len(df)
letters = np.tile(box.columns, (len_df,1))[box]
pd.DataFrame({'key':df.key.array,
'letter' : letters,
'Bool' : [True]*len_df})
key letter Bool
0 1 a True
1 2 a True
2 3 b True
CodePudding user response:
melt moves column data and stacks it vertically resulting in two columns: the variable name of the column being stacked and the value column name.
d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True],'c':['Batchelor','Masters','Doctorate']}
df = pd.DataFrame(d)
df2 = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Boolean')
df2=df2.drop(df2.index[df2['Boolean'] == False])
print(df2)
output
key letter Boolean
0 1 a True
1 2 a True
5 3 b True
6 1 c Batchelor
7 2 c Masters
8 3 c Doctorate