Is there a better way to summerize a table information instead of using iloc and append method with-CodePudding

I have a table of grant application numbers and their corresponding disciplines given by cell value 1.

import pandas as pd
import numpy as np
data={'Application number':[0,1,2,3,4,5,6,7,8,9],
      'Physics':[0,0,1,0,0,0,0,1,1,0],
     'Chemistry':[1,0,0,0,0,0,0,0,0,1],
     'Biology':[0,1,0,1,0,0,1,0,0,0],
     'Mathematics':[0,0,0,0,1,1,0,0,0,0]}

#creation of dataframe
df=pd.DataFrame(data)

#column counting all disciplines per grant
df['All_Discipline_count']=df.loc[:,'Physics' : 'Mathematics'].sum(axis=1)

df.head(10)

I would like to summarize discipline list and discipline count per grant application. I do that using iloc and multiple nested loops.

# Creation of resulting dataframe
dfA = pd.DataFrame(columns = ['Application number', 'Discipline_list', 'All_Discipline_count'])

# Pay attention to how iloc a cell selects. 'Application number' is zeroth column. 
i=0 #starts from oth row
j=1 #starts from 1st column
Aanvraag_nummer=0
k=df.columns.get_loc("All_Discipline_count") #column number where the All_Discipline_count is
l=len(df.index)#number of rows
for i in range (0,l):
    Discipline_count=0 #introducing zero discipline count
    Discipline_list=" " #introducing empty discipline list
    for j in range (1,k): #counting columns of disciplines
        if (df.iloc[i,j]==1) & (Discipline_count<df.iloc[i,k]): #if the given cell has 1 as value
            Discipline_list=Discipline_list  df.columns[j] #adds a column name to discipline list
            Discipline_count =1 #counts the number of disciplines with 1 as value
            if Discipline_count==df.iloc[i,k]:#if all disciplines are counted
                Aanvraag_nummer=df.iloc[i,0]
                new_row = {'Application number':Aanvraag_nummer, 'Discipline_list':Discipline_list, 'All_Discipline_count':df.iloc[i,k]}
                dfA = dfA.append(new_row, ignore_index=True)
dfA.head(10)

The script works for 10 to 100 applications and 20 disciplines as columns. It also works when there are multiple disciplines are given per grant application.

However, I notice that I get warning while running the code.

/tmp/ipykernel_26718/1290491379.py:19: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

The code is also slow..Any better method to get the same results?

CodePudding user response：

You just need to use pandas.idxmax in this case. Be aware: This only works if there is only 1 application per row. If there are multiple it will only choose the first it finds.

df['Application Count'] = df.iloc[:,1:].idxmax(axis=1) # iloc because we only want to apply idxmax on the applications (without application number)
#or
df['Application Count'] = df.set_index('Application number').idxmax(axis=1)

In case you have multiple subjects per row here is another solution (I changed Input for that a bit, in row 0 and 8 there are 2 subjects, in row 4 we have 4 subjects.

data={'Application number':[0,1,2,3,4,5,6,7,8,9],
      'Physics':[1,0,1,0,1,0,0,1,1,0],
     'Chemistry':[1,0,0,0,1,0,0,0,0,1],
     'Biology':[0,1,0,1,1,0,1,0,0,0],
     'Mathematics':[0,0,0,0,1,1,0,0,1,0]}

df=pd.DataFrame(data)
print(df)

   Application number  Physics  Chemistry  Biology  Mathematics
0                   0        1          1        0            0
1                   1        0          0        1            0
2                   2        1          0        0            0
3                   3        0          0        1            0
4                   4        1          1        1            1
5                   5        0          0        0            1
6                   6        0          0        1            0
7                   7        1          0        0            0
8                   8        1          0        0            1
9                   9        0          1        0            0

df = df.set_index('Application number')

out = (
    df[df==1]
    .stack()
    .reset_index()
    .drop(0, axis=1)
    .rename(columns={'level_1': 'Discipline_list'})
    .groupby('Application number', as_index=False)
    .agg(Discipline_list=('Discipline_list', lambda x: ', '.join(x)), All_Discipline_count=('Discipline_list', 'count'))
)

print(out)

   Application number                           Discipline_list  All_Discipline_count
0                   0                        Physics, Chemistry                     2
1                   1                                   Biology                     1
2                   2                                   Physics                     1
3                   3                                   Biology                     1
4                   4  Physics, Chemistry, Biology, Mathematics                     4
5                   5                               Mathematics                     1
6                   6                                   Biology                     1
7                   7                                   Physics                     1
8                   8                      Physics, Mathematics                     2
9                   9                                 Chemistry                     1

CodePudding user response：

you can make Discipline_list following code:

df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)

output:

0      Chemistry
1        Biology
2        Physics
3        Biology
4    Mathematics
5    Mathematics
6        Biology
7        Physics
8        Physics
9      Chemistry
dtype: object

If there is over 1 discipline list, it is expressed like 'Biology/Phsics'

make result to Discipline_list column

df['Discipline_list'] = df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)