count topics in each cluster and divide by a total number of topics for each student in python-CodePudding

I have two dataframes (df1, df2), df1 containing the student name, topic preference of each student, and df_topics containing topics.

Here is a sample input dataframe:

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
            'topics':['algebra; atom; geometry; evolution; food safety',
                      'chemical reaction; linear algebra; Probability; quantum',
                      'botany; electricity; mechanics',
                      'Statistics; botany; number theory; atom; evolution; Probability']})

   df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
                              'evolution', 'botany', 'quantum'],    
                    'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
                   })

I want to represent each student by k (here k=4) dimensional binary vector, i.e. if the student has topics in df2, count topics are in each cluster and divide by a total number of topics for this student. for example, student 1 has two topics in cluster 0, algebra and geometry, which divide by 5 (total number of topics for student 1), we get 0.4 and etc.

the results should look like this :

CodePudding user response：

There is probably a better way to achieve what you want but it should work:

out = (
    df1.assign(topics=df1['topics'].str.split('; ')).explode('topics')
       .merge(df2, how='left', on='topics')
       .assign(size=lambda x: x.groupby('Name')['cluster'].transform('size'),
               count=lambda x: x.groupby(['Name', 'cluster'])['cluster'].transform('count'),
               ratio=lambda x: x['count'] / x['size'])
       .query('cluster.notna()').astype({'cluster': int})
       .drop_duplicates(['Name', 'cluster'])
       .pivot_table('ratio', 'Name', 'cluster', aggfunc='sum', fill_value=0, margins=True)
       .rename_axis(index=None, columns=None)
)

Output:

>>> out
                  0         1         2     3       All
student 1  0.400000  0.200000  0.200000  0.00  0.800000
student 2  0.250000  0.250000  0.000000  0.25  0.750000
student 3  0.000000  0.000000  0.333333  0.00  0.333333
student 4  0.166667  0.166667  0.333333  0.00  0.666667
All        0.816667  0.616667  0.866667  0.25  2.550000

CodePudding user response：

Explode df1 topics into one row per topic per student

  df1 =df1.assign(topics=df1['topics'].str.split(';')).explode('topics')#Explode each topic into a row

Map df2 topics and clusters to df1

df1= df1.assign(topics=df1['topics'].str.strip().map(dict(zip(df2['topics'].str.strip(), df2['cluster'].astype(str)))).fillna('5'))

Compute the total count per row. You will notice I filled NaN with 5 to ensure where nothing was mapped ias also counted. Compute ratio per row

new =pd.crosstab(df1.Name, df1.topics).apply(lambda x:x/x.sum(),axis=1).drop(columns=['5'])#, margins_name="count").agg(lambda x:round(x/x['count'],1), axis=1).drop(columns=['5','count'])

#Sum row wise and column wise

#new['All'],new.loc['All']= new.sum(1),new.sum(0)

new['All']= new.sum(1)
new.loc['All']=new.sum(0)
print(new)

Output

topics            0         1         2     3       All
Name                                                   
student 1  0.400000  0.200000  0.200000  0.00  0.800000
student 2  0.250000  0.250000  0.000000  0.25  0.750000
student 3  0.000000  0.000000  0.333333  0.00  0.333333
student 4  0.166667  0.166667  0.333333  0.00  0.666667
All        0.816667  0.616667  0.866667  0.25  2.550000

CodePudding user response：

Here's an answer that delegates most of the work to a function named cluster() called via apply() for the student dataframe:

import pandas as pd
import numpy as np
    
df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
            'topics':['algebra; atom; geometry; evolution; food safety',
                      'chemical reaction; linear algebra; Probability; quantum',
                      'botany; electricity; mechanics',
                      'Statistics; botany; number theory; atom; evolution; Probability']})

df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
                              'evolution', 'botany', 'quantum'],    
                    'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
                   })
print(df1)
print('\n', df2)

clusterNames = df2['cluster'].unique().tolist()   ['All']
clusters = [set(df2[df2['cluster'] == c]['topics']) for c in clusterNames[:-1]]
def cluster(x):
    studentTopics = set(top.strip() for top in x['topics'].split(';'))
    clusterPct = [len(clusterTopics & studentTopics) / len(studentTopics) for clusterTopics in clusters]
    clusterPct  = [sum(clusterPct)]
    return clusterPct
df1[clusterNames] = df1.apply(cluster, axis=1).tolist()
df1 = pd.concat([df1.drop('topics', axis=1), pd.DataFrame([['All']   df1[clusterNames].sum(axis=0).tolist()], columns=['Name']   clusterNames)])
print('\n', df1)

Output:

        Name                                             topics
0  student 1    algebra; atom; geometry; evolution; food safety
1  student 2  chemical reaction; linear algebra; Probability...
2  student 3                     botany; electricity; mechanics
3  student 4  Statistics; botany; number theory; atom; evolu...

               topics  cluster
0            algebra        0
1        Probability        0
2           geometry        0
3               atom        1
4  chemical reaction        1
5          evolution        2
6             botany        2
7            quantum        3

         Name         0         1         2     3       All
0  student 1  0.400000  0.200000  0.200000  0.00  0.800000
1  student 2  0.250000  0.250000  0.000000  0.25  0.750000
2  student 3  0.000000  0.000000  0.333333  0.00  0.333333
3  student 4  0.166667  0.166667  0.333333  0.00  0.666667
0        All  0.816667  0.616667  0.866667  0.25  2.550000