I have a table of grant application numbers and their corresponding disciplines given by cell value 1.
import pandas as pd
import numpy as np
data={'Application number':[0,1,2,3,4,5,6,7,8,9],
'Physics':[0,0,1,0,0,0,0,1,1,0],
'Chemistry':[1,0,0,0,0,0,0,0,0,1],
'Biology':[0,1,0,1,0,0,1,0,0,0],
'Mathematics':[0,0,0,0,1,1,0,0,0,0]}
#creation of dataframe
df=pd.DataFrame(data)
#column counting all disciplines per grant
df['All_Discipline_count']=df.loc[:,'Physics' : 'Mathematics'].sum(axis=1)
df.head(10)
I would like to summarize discipline list and discipline count per grant application. I do that using iloc and multiple nested loops.
# Creation of resulting dataframe
dfA = pd.DataFrame(columns = ['Application number', 'Discipline_list', 'All_Discipline_count'])
# Pay attention to how iloc a cell selects. 'Application number' is zeroth column.
i=0 #starts from oth row
j=1 #starts from 1st column
Aanvraag_nummer=0
k=df.columns.get_loc("All_Discipline_count") #column number where the All_Discipline_count is
l=len(df.index)#number of rows
for i in range (0,l):
Discipline_count=0 #introducing zero discipline count
Discipline_list=" " #introducing empty discipline list
for j in range (1,k): #counting columns of disciplines
if (df.iloc[i,j]==1) & (Discipline_count<df.iloc[i,k]): #if the given cell has 1 as value
Discipline_list=Discipline_list df.columns[j] #adds a column name to discipline list
Discipline_count =1 #counts the number of disciplines with 1 as value
if Discipline_count==df.iloc[i,k]:#if all disciplines are counted
Aanvraag_nummer=df.iloc[i,0]
new_row = {'Application number':Aanvraag_nummer, 'Discipline_list':Discipline_list, 'All_Discipline_count':df.iloc[i,k]}
dfA = dfA.append(new_row, ignore_index=True)
dfA.head(10)
The script works for 10 to 100 applications and 20 disciplines as columns. It also works when there are multiple disciplines are given per grant application.
However, I notice that I get warning while running the code.
/tmp/ipykernel_26718/1290491379.py:19: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The code is also slow..Any better method to get the same results?
CodePudding user response:
You just need to use pandas.idxmax
in this case.
Be aware: This only works if there is only 1 application per row. If there are multiple it will only choose the first it finds.
df['Application Count'] = df.iloc[:,1:].idxmax(axis=1) # iloc because we only want to apply idxmax on the applications (without application number)
#or
df['Application Count'] = df.set_index('Application number').idxmax(axis=1)
In case you have multiple subjects per row here is another solution (I changed Input for that a bit, in row 0 and 8 there are 2 subjects, in row 4 we have 4 subjects.
data={'Application number':[0,1,2,3,4,5,6,7,8,9],
'Physics':[1,0,1,0,1,0,0,1,1,0],
'Chemistry':[1,0,0,0,1,0,0,0,0,1],
'Biology':[0,1,0,1,1,0,1,0,0,0],
'Mathematics':[0,0,0,0,1,1,0,0,1,0]}
df=pd.DataFrame(data)
print(df)
Application number Physics Chemistry Biology Mathematics
0 0 1 1 0 0
1 1 0 0 1 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 1 1
5 5 0 0 0 1
6 6 0 0 1 0
7 7 1 0 0 0
8 8 1 0 0 1
9 9 0 1 0 0
df = df.set_index('Application number')
out = (
df[df==1]
.stack()
.reset_index()
.drop(0, axis=1)
.rename(columns={'level_1': 'Discipline_list'})
.groupby('Application number', as_index=False)
.agg(Discipline_list=('Discipline_list', lambda x: ', '.join(x)), All_Discipline_count=('Discipline_list', 'count'))
)
print(out)
Application number Discipline_list All_Discipline_count
0 0 Physics, Chemistry 2
1 1 Biology 1
2 2 Physics 1
3 3 Biology 1
4 4 Physics, Chemistry, Biology, Mathematics 4
5 5 Mathematics 1
6 6 Biology 1
7 7 Physics 1
8 8 Physics, Mathematics 2
9 9 Chemistry 1
CodePudding user response:
you can make Discipline_list
following code:
df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)
output:
0 Chemistry
1 Biology
2 Physics
3 Biology
4 Mathematics
5 Mathematics
6 Biology
7 Physics
8 Physics
9 Chemistry
dtype: object
If there is over 1 discipline list, it is expressed like 'Biology/Phsics'
make result to Discipline_list
column
df['Discipline_list'] = df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)