Home > OS >  Using Pandas - Multiple Columns using in data for some analysis
Using Pandas - Multiple Columns using in data for some analysis

Time:04-09

In my data set, I have students data with some attempts - First,Second,Third,Four and Five, I have to find out each students for how many they taken attempt for taking >=50 marks, If one person complete in first attempt, then my output showing - "Column title" and sum of marks.

Here is my dummy data set

  Person  First  Second  Third  Four  Five
0      A     21      21     15    14     0
1      B     12      15     19     5    19
2      C     50       0     18     7     0
3      D     50       6      0    15    35
4      E      0      25     25    10    24

Output:

enter image description here

I am trying to find out but I am stuck in getting the correct solution.

Here is my work

import pandas as pd
import numpy as np
df
df['Total'] = df.sum(axis=1)


df['Output'] = df.apply(lambda row: (row['First'] row['Second'] row['Third'])  row['Four'] row['Five'] if row['A'] > 50)

Output = np.array(df[['First','Second','Third','Four','Five']].values.tolist())
df[['First','Second','Third','Four','Five']] = np.where(Output > 50, 50, Output).tolist()
df.groupby('Person')[['First','Second','Third','Four','Five']>50].sum()

df['Output'] = np.where(df.sum(axis=1)) >50

Please help.

Output

CodePudding user response:

Try this:

cmax = df.select_dtypes(include="number").cumsum(axis=1)
df['Output'] = cmax.ge(50).idxmax(axis=1)
df['Sum'] = cmax.stack().loc[zip(df.index, df['Output'])].to_numpy()
df

Output:

  Person  First  Second  Third  Four  Five Output  Sum
0      A     21      21     15    14     0  Third   57
1      B     12      15     19     5    19   Four   51
2      C     50       0     18     7     0  First   50
3      D     50       6      0    15    35  First   50
4      E      0      25     25    10    24  Third   50

Details:

Let's use select_dtypes to only get those attempts columns with numbers, then do a cumulative sum of the rows with axis=1.

Then, use that dataframe and find all values equal or greater than 50 to create a boolean dataframe. Using idxmax along the row, we can find the first True value, ie the first time the cumulative sum reaches 50. And, idxmax will return that column headers.

Lastly, we can use stack the cumulative sum dataframe and use loc to get values from cumulative sum that matches, the index with the output column.

CodePudding user response:

Try this:

df = pd.DataFrame([['A',21,21,15,14,0],['B',12,15,19,5,19],['C',50,0,18,7,0],['D',50,6,0,15,35],['E',0,25,25,10,14]])
df.columns= 'Person,First,Second,Third,Four,Five'.split(',')
data = []
for row in df.iloc:
    total = 0
    index = 1
    for value in row.values[1:]:
        total  = value
        if total >= 50:
            break
        index  = 1
    data.append({'Output': row.keys()[index], 'Sum': total})

df = pd.concat([df.iloc[:, 0],pd.DataFrame(data)],axis=1)

Output:

  Person Output  Sum
0      A  Third   57
1      B   Four   51
2      C  First   50
3      D  First   50
4      E  Third   50


In [2]: import pandas as pd

In [3]: df = pd.DataFrame([['A',21,21,15,14,0],['B',12,15,19,5,19],['C',50,0,18,7,0],['D',50,6,0,15,35],['E',0,25,25,10,14]])^M
   ...: df.columns= 'Person,First,Second,Third,Four,Five'.split(',')^M
   ...: data = []^M
   ...: for row in df.iloc:^M
   ...:     total = 0^M
   ...:     index = 1^M
   ...:     for value in row.values[1:]:^M
   ...:         total  = value^M
   ...:         if total >= 50:^M
   ...:             break^M
   ...:         index  = 1^M
   ...:     data.append({'Output': row.keys()[index], 'Sum': total})^M
   ...: ^M
   ...: df = pd.concat([df.iloc[:, 0],pd.DataFrame(data)],axis=1)

In [4]: df
Out[4]:
  Person Output  Sum
0      A  Third   57
1      B   Four   51
2      C  First   50
3      D  First   50
4      E  Third   50

CodePudding user response:

I think you can do something like this:

columns = ['First','Second','Third','Four','Five']

THRESHOLD = 50

def f(x):
    sum= 0
    x['Output'] = ''
    x['Sum']=0
    for column in columns:
        sum =x[column]
        if (sum>=THRESHOLD):
            x['Output'] = column
            x['Sum'] = sum
            break
    return x

df.apply(f, axis = 1)
  • Related