How to keep track of element from another dataframe in pandas-CodePudding

I have the following df:

df = pd.DataFrame({"call 1": ['debit card','bond',np.nan],
                  "call 2": ['credit card','mortgage','spending limit'],
                  "call 3":['payment limit',np.nan,np.nan]})

which is:

       call 1          call 2         call 3
0  debit card     credit card  payment limit
1        bond        mortgage            NaN
2         NaN  spending limit            NaN

I've further done some clustering and produce a new df as:

dfc = pd.DataFrame( {'cluster 1': ['payment limit', 'spending limit'],
 'cluster 2': ['debit card', 'credit card'],
 'cluster 3': [ 'bond', 'mortgage']})

        cluster 1    cluster 2 cluster 3
0   payment limit   debit card      bond
1  spending limit  credit card  mortgage

Now in dfc I want to know where each word is coming from for example payment limit is originally from call 3 etc. In fact I wonder how to make a new df from these two dataframes such that I have:

print(pd.DataFrame( {'cluster 1': [{'call 3': 'payment limit'}, {'call 2':'spending limit'}],
 'cluster 2': [{'call 1':'debit card'}, {'call 2':'credit card'}],
 'cluster 3': [ {'call 1':'bond'}, {'call 2':'mortgage'}]}))

CodePudding user response：

dfc.applymap(lambda x: df[df.eq(x)].dropna(how='all').dropna(axis=1).to_dict('records')[0])

Output:

                      cluster 1                  cluster 2               cluster 3
0   {'call 3': 'payment limit'}   {'call 1': 'debit card'}      {'call 1': 'bond'}
1  {'call 2': 'spending limit'}  {'call 2': 'credit card'}  {'call 2': 'mortgage'}

CodePudding user response：

We can create a lookup dictionary and add the key:value from our first dataframe. For the second dataframe we replace the values if the same is found in our lookup dictionary

lookup_dict = {}
look_df = df.T

for col in look_df.columns:
    lookup_dict.update(dict(zip(look_df[col], look_df.index)))

pd.concat([dfc.replace(lookup_dict), dfc]).astype(str).groupby(level=0).agg(tuple)

Output :

This gives us :

                  cluster 1              cluster 2           cluster 3
0   (call 3, payment limit)   (call 1, debit card)      (call 1, bond)
1  (call 2, spending limit)  (call 2, credit card)  (call 2, mortgage)