Home > Back-end >  Is there a better way to summerize a table information instead of using iloc and append method with
Is there a better way to summerize a table information instead of using iloc and append method with

Time:11-07

I have a table of grant application numbers and their corresponding disciplines given by cell value 1.

import pandas as pd
import numpy as np
data={'Application number':[0,1,2,3,4,5,6,7,8,9],
      'Physics':[0,0,1,0,0,0,0,1,1,0],
     'Chemistry':[1,0,0,0,0,0,0,0,0,1],
     'Biology':[0,1,0,1,0,0,1,0,0,0],
     'Mathematics':[0,0,0,0,1,1,0,0,0,0]}

#creation of dataframe
df=pd.DataFrame(data)

#column counting all disciplines per grant
df['All_Discipline_count']=df.loc[:,'Physics' : 'Mathematics'].sum(axis=1)

df.head(10)

I would like to summarize discipline list and discipline count per grant application. I do that using iloc and multiple nested loops.

# Creation of resulting dataframe
dfA = pd.DataFrame(columns = ['Application number', 'Discipline_list', 'All_Discipline_count'])

# Pay attention to how iloc a cell selects. 'Application number' is zeroth column. 
i=0 #starts from oth row
j=1 #starts from 1st column
Aanvraag_nummer=0
k=df.columns.get_loc("All_Discipline_count") #column number where the All_Discipline_count is
l=len(df.index)#number of rows
for i in range (0,l):
    Discipline_count=0 #introducing zero discipline count
    Discipline_list=" " #introducing empty discipline list
    for j in range (1,k): #counting columns of disciplines
        if (df.iloc[i,j]==1) & (Discipline_count<df.iloc[i,k]): #if the given cell has 1 as value
            Discipline_list=Discipline_list  df.columns[j] #adds a column name to discipline list
            Discipline_count =1 #counts the number of disciplines with 1 as value
            if Discipline_count==df.iloc[i,k]:#if all disciplines are counted
                Aanvraag_nummer=df.iloc[i,0]
                new_row = {'Application number':Aanvraag_nummer, 'Discipline_list':Discipline_list, 'All_Discipline_count':df.iloc[i,k]}
                dfA = dfA.append(new_row, ignore_index=True)
dfA.head(10)

The script works for 10 to 100 applications and 20 disciplines as columns. It also works when there are multiple disciplines are given per grant application.

However, I notice that I get warning while running the code.

/tmp/ipykernel_26718/1290491379.py:19: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

The code is also slow..Any better method to get the same results?

CodePudding user response:

You just need to use pandas.idxmax in this case. Be aware: This only works if there is only 1 application per row. If there are multiple it will only choose the first it finds.

df['Application Count'] = df.iloc[:,1:].idxmax(axis=1) # iloc because we only want to apply idxmax on the applications (without application number)
#or
df['Application Count'] = df.set_index('Application number').idxmax(axis=1)

In case you have multiple subjects per row here is another solution (I changed Input for that a bit, in row 0 and 8 there are 2 subjects, in row 4 we have 4 subjects.

data={'Application number':[0,1,2,3,4,5,6,7,8,9],
      'Physics':[1,0,1,0,1,0,0,1,1,0],
     'Chemistry':[1,0,0,0,1,0,0,0,0,1],
     'Biology':[0,1,0,1,1,0,1,0,0,0],
     'Mathematics':[0,0,0,0,1,1,0,0,1,0]}

df=pd.DataFrame(data)
print(df)
   Application number  Physics  Chemistry  Biology  Mathematics
0                   0        1          1        0            0
1                   1        0          0        1            0
2                   2        1          0        0            0
3                   3        0          0        1            0
4                   4        1          1        1            1
5                   5        0          0        0            1
6                   6        0          0        1            0
7                   7        1          0        0            0
8                   8        1          0        0            1
9                   9        0          1        0            0
df = df.set_index('Application number')

out = (
    df[df==1]
    .stack()
    .reset_index()
    .drop(0, axis=1)
    .rename(columns={'level_1': 'Discipline_list'})
    .groupby('Application number', as_index=False)
    .agg(Discipline_list=('Discipline_list', lambda x: ', '.join(x)), All_Discipline_count=('Discipline_list', 'count'))
)

print(out)
   Application number                           Discipline_list  All_Discipline_count
0                   0                        Physics, Chemistry                     2
1                   1                                   Biology                     1
2                   2                                   Physics                     1
3                   3                                   Biology                     1
4                   4  Physics, Chemistry, Biology, Mathematics                     4
5                   5                               Mathematics                     1
6                   6                                   Biology                     1
7                   7                                   Physics                     1
8                   8                      Physics, Mathematics                     2
9                   9                                 Chemistry                     1

CodePudding user response:

you can make Discipline_list following code:

df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)

output:

0      Chemistry
1        Biology
2        Physics
3        Biology
4    Mathematics
5    Mathematics
6        Biology
7        Physics
8        Physics
9      Chemistry
dtype: object

If there is over 1 discipline list, it is expressed like 'Biology/Phsics'


make result to Discipline_list column

df['Discipline_list'] = df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)
  • Related