why is pd.crosstab not giving the expected output in python pandas?-CodePudding

I have a 2dataframes, which I am calling as df1 and df2.

df1 has columns like KPI and context and it looks like this.

    KPI                                                 Context 
0   Does the company have a policy in place to man...   Anti-Bribery Policy\nBroadridge does not toler...   
1   Does the company have a supplier code of conduct?   Vendor Code of Conduct Our vendors play an imp...   
2   Does the company have a grievance/complaint ha...   If you ever have a question or wish to report ...   
3   Does the company have a human rights policy ?   Human Rights Statement of Commitment Broadridg...   
4   Does the company have a policies consistent wi...   Anti-Bribery Policy\nBroadridge does not toler...

df2 has a single column 'keyword'

df2:

Keyword
0   1.5 degree
1   1.5°
2   2 degree
3   2°
4   accident

I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.

for which I have used pd.crosstab() however I suspect that its not giving me the expected output.

here's what I have tried so far.

new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())

the new_df looks like this.

   KPI                                                       1.5 degree  1.5°  \
0  Does the Supplier code of conduct cover one or...         NaN   NaN   
1  Does the companies have sites/operations locat...         NaN   NaN   
2  Does the company have a due diligence process ...         NaN   NaN   
3  Does the company have a grievance/complaint ha...         NaN   NaN   
4  Does the company have a grievance/complaint ha...         NaN   NaN   

   2 degree  2°          accident  
0       NaN NaN           NaN              
1       NaN NaN           NaN             
2       NaN NaN           NaN          
3       1.0 NaN           NaN              
4       NaN NaN           NaN

The expected output which I want is something like this.

0   KPI                                            1.5   degree 1.5°    2 degree    2°  accident
1   Does the company have a policy in place to man 44     2     3       5           9

what exactly am I missing? please let me know, thanks!

CodePudding user response：

There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:

import re

pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])

df1['new'] = df1['Context'].str.findall(pat, flags=re.I)

new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])