Home > other >  how to form the dataframe in turn wise using Python
how to form the dataframe in turn wise using Python

Time:09-29

I have the dataframe like below:

import pandas as pd
df = pd.DataFrame.from_dict(
    {
        'Name': ['Jenny', 'Matt', 'Jenny', 'Jenny','Jenny', 'Matt'],
        'Year': [2020, 2021, 2022, 2020, 2021,2030],
        'Income': [10000, 11000, 9000, 12000, 13000,5000],
        'Gender': ['F', 'M', 'F', 'F','F', 'M']
    }
)
print(df)

Output of this DataFrame looks like below

Name  Year  Income Gender
0  Jenny  2020   10000      F
1   Matt  2021   11000      M
2  Jenny  2022    9000      F
3  Jenny  2020   12000      F
4  Jenny  2021   13000      F
5   Matt  2030    5000      M

Now i want to use Gender column it should start from Female (F) and ends with Male (M) that finally should be single row, if two or three sequential female(F)/Male(M) rows occurs then all the corresponding Year and Income row values need to be added and form a single row for Female/male.

Expecting output as below:
  Year  Income Gender 
  2020   10000     F 
  2021   11000     M  
  6063   34000     F 
  2030    5000     M

I have thousands of columns so i am using iterator Below is the code i used but i am getting results only for two sequential gender and even on that i am having issues.

i=0
j=0
column_lst = list(df)
column_lst = [e for e in column_lst if e not in ('Name', 'Gender')]
column_lst
df_lst1 = []
for index, row in df.iterrows():
    if(row['Gender']=='F'):
        j=0
        i=i 1
        for lst in column_lst:
            print(lst)
            print(i)
            if(i>1):
                a = int(row[lst])
                print('a',a)
                c=a b
                print('c',c)
                f=row['Gender'] '_' str(c)
                del df_lst1[-1]
            else:
                f=row['Gender'] '_' str(row[lst])
                b=int(row[lst])
                print('b',b)
            print("df_lst1111",df_lst1)
            df_lst1.append(f)
    elif (row['Gender']=='M'):
        i=0
        j=j 1
        for lst in column_lst:
            if(j>1):
                a1=int(row[lst])
                print('a1',a1)
                c1=a1 b1
                print('c1',c1)
                f1=row['Gender'] '_' str(c1)
                del df_lst1[-1]
            else:
                f1=row['Gender'] '_' str(row[lst])
                b1=int(row[lst])
                print('b1',b1)
            print("df_lst1",df_lst1)
            df_lst1.append(f1)

I am getting output as below:

['F_2020',
 'F_10000',
 'M_2021',
 'M_11000',
 'F_2022',
 'F_22000',
 'M_2030',
 'M_5000']

it would be great if someone suggest how to implement

Other Ex:

df = pd.DataFrame.from_dict(
    {
        'Name': ['Jenny', 'Matt', 'Jenny', 'Jenny', 'Matt', 'Jenny', 'Jenny', 'Matt', 'Matt'],
        'Year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
        'Income': [10000, 11000, 9000, 12000, 13000, 11000, 14000, 15000, 13000],
        'Gender': ['F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M']
    }
)

Expected result would be:

Year Income Gender 
2020 10000 F
2021 11000 M
4042 21000 F
2021 13000 M
4042 25000 F
4042 28000 M

CodePudding user response:

Hope I understood your question correct:

To get your desire output I divide code into 3 steps

Code:

df = df.loc[:, df.columns != 'Name']
 
#Step 1 - Get M/F first records and create separate dataframe
f1 = df[df['Gender']=='F'].head(1)
m1 = df[df['Gender']=='M'].head(1)
mf = pd.concat([f1,m1])

#Step 2 - group by sum reset of M/F records and create separate dataframe
data= pd.concat([df,mf]).drop_duplicates(keep=False).groupby('Gender').agg(['sum']).stack().reset_index()

#Step 3 - lastly, merge the first records df and rest records df 
pd.concat([data,mf]).reset_index(drop=True)

Output:

   Gender   level_1 Year    Income
0   F       sum    6063    34000
1   M       sum    2030    5000
2   F       NaN    2020    10000
3   M       NaN    2021    11000

CodePudding user response:

With pandas you should only iterate as a last resort.

Something like this should get you going:

df.groupby((df.Gender!=df.Gender.shift()).cumsum()).agg({'Year': sum, 'Income': sum, 'Gender': 'first'}).reset_index(drop=True).sort_values('Gender')


   Year  Income Gender
0  2020   10000      F
2  6063   34000      F
1  2021   11000      M
3  2030    5000      M

Let's take that apart:

df.groupby((df.Gender!=df.Gender.shift()))

will create groups of rows, where Gender is the same for consecutive rows. I.e. the groups will be [0], [1], [2,3,4], [5]

.cumsum() will return a cumulative sum over each of the groups.

With df.groupby((df.Gender!=df.Gender.shift()).cumsum()).agg({'Year': sum, 'Income': sum, 'Gender': 'first'}) we tell pandas how we want to calculate the cumulative sum for each column. In this case, sum Year (whatever sense that makes) and Income, and just take the first value of Gender from each group (as we have grouped by Gender the first value will be the same as all other values for this group.

Finally reset_index(drop=True).sort_values('Gender') will remove the artificial index column, which groupby has created and sort the value by Gender, so that F comes before M

  • Related