Binary DataFrame to dictionary-CodePudding

Hi I am trying to convert a dataframe into a dictionary using python. The dataframe is similar as:

data_quim = {'Type':[1,2,3],'BCN':['X',np.nan,np.nan],'MAD':[np.nan,'X',np.nan],'MAT':['X',np.nan,'X']}

df = pd.DataFrame(data_quim)
df

   Type     BCN MAD MAT

0   1        X  NaN X
1   2       NaN X   NaN
2   3       NaN NaN X

I want to get a dictionary with this form:

{1:['BCN','MAT],
2:['MAD],
3:['MAT]}

I have tried to function to_dict(), with setting first 'Type' to index, but cannot get exactly what I want. Any thoughts?

CodePudding user response：

You could use:

(df.set_index('Type').stack()
   .reset_index(1)
   ['level_1'].groupby(level=0)
   .apply(list)
   .to_dict()
)

output:

{1: ['BCN', 'MAT'], 2: ['MAD'], 3: ['MAT']}

Or, using apply:

(df.set_index('Type')
   .apply(lambda s: list(s.dropna().index), axis=1)
   .to_dict()
)

CodePudding user response：

I really like @mozway's second answer for this specific problem but apply sometimes isn't fastest. Since pandas is so flexible, you could also do this with pd.melt which is a very flexible way for transforming wide data to narrow data and lean on groupby. This would be more helpful if you need to do something with more conditions/processing before putting it in the list (what if you have 'X','Y','NaN', and want to do something different with 'X' vs 'Y'):

df.melt(id_vars='Type')\
  .dropna()\
  .groupby('Type')\
  .agg({'variable': list})\
  .to_dict()['variable']

Or if you wanted to get out of pandas as quick as possible for some reason:

{
    row[0]: [
        label 
        for label, indicator in zip(df.columns[1:], row[1:]) 
        if indicator == 'X'
    ] 
    for row in df.values
}

The big picture is that this is a good problem for mapping the original wide table to a key-value pair for each 'X', and then reducing by key all of the values into a list. Since your rows are already grouped by key, I really like @mozway's second suggestion. My second suggestion is a way to do it outside of pandas that would also work well if you instead needed to work with a streaming iterator for each row.