I have the dataframe like below:
import pandas as pd
df = pd.DataFrame.from_dict(
{
'Name': ['Jenny', 'Matt', 'Jenny', 'Jenny','Jenny', 'Matt'],
'Year': [2020, 2021, 2022, 2020, 2021,2030],
'Income': [10000, 11000, 9000, 12000, 13000,5000],
'Gender': ['F', 'M', 'F', 'F','F', 'M']
}
)
print(df)
Output of this DataFrame looks like below
Name Year Income Gender
0 Jenny 2020 10000 F
1 Matt 2021 11000 M
2 Jenny 2022 9000 F
3 Jenny 2020 12000 F
4 Jenny 2021 13000 F
5 Matt 2030 5000 M
Now i want to use Gender column it should start from Female (F) and ends with Male (M) that finally should be single row, if two or three sequential female(F)/Male(M) rows occurs then all the corresponding Year and Income row values need to be added and form a single row for Female/male.
Expecting output as below:
Year Income Gender
2020 10000 F
2021 11000 M
6063 34000 F
2030 5000 M
I have thousands of columns so i am using iterator Below is the code i used but i am getting results only for two sequential gender and even on that i am having issues.
i=0
j=0
column_lst = list(df)
column_lst = [e for e in column_lst if e not in ('Name', 'Gender')]
column_lst
df_lst1 = []
for index, row in df.iterrows():
if(row['Gender']=='F'):
j=0
i=i 1
for lst in column_lst:
print(lst)
print(i)
if(i>1):
a = int(row[lst])
print('a',a)
c=a b
print('c',c)
f=row['Gender'] '_' str(c)
del df_lst1[-1]
else:
f=row['Gender'] '_' str(row[lst])
b=int(row[lst])
print('b',b)
print("df_lst1111",df_lst1)
df_lst1.append(f)
elif (row['Gender']=='M'):
i=0
j=j 1
for lst in column_lst:
if(j>1):
a1=int(row[lst])
print('a1',a1)
c1=a1 b1
print('c1',c1)
f1=row['Gender'] '_' str(c1)
del df_lst1[-1]
else:
f1=row['Gender'] '_' str(row[lst])
b1=int(row[lst])
print('b1',b1)
print("df_lst1",df_lst1)
df_lst1.append(f1)
I am getting output as below:
['F_2020',
'F_10000',
'M_2021',
'M_11000',
'F_2022',
'F_22000',
'M_2030',
'M_5000']
it would be great if someone suggest how to implement
Other Ex:
df = pd.DataFrame.from_dict(
{
'Name': ['Jenny', 'Matt', 'Jenny', 'Jenny', 'Matt', 'Jenny', 'Jenny', 'Matt', 'Matt'],
'Year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
'Income': [10000, 11000, 9000, 12000, 13000, 11000, 14000, 15000, 13000],
'Gender': ['F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M']
}
)
Expected result would be:
Year Income Gender
2020 10000 F
2021 11000 M
4042 21000 F
2021 13000 M
4042 25000 F
4042 28000 M
CodePudding user response:
Hope I understood your question correct:
To get your desire output I divide code into 3 steps
Code:
df = df.loc[:, df.columns != 'Name']
#Step 1 - Get M/F first records and create separate dataframe
f1 = df[df['Gender']=='F'].head(1)
m1 = df[df['Gender']=='M'].head(1)
mf = pd.concat([f1,m1])
#Step 2 - group by sum reset of M/F records and create separate dataframe
data= pd.concat([df,mf]).drop_duplicates(keep=False).groupby('Gender').agg(['sum']).stack().reset_index()
#Step 3 - lastly, merge the first records df and rest records df
pd.concat([data,mf]).reset_index(drop=True)
Output:
Gender level_1 Year Income
0 F sum 6063 34000
1 M sum 2030 5000
2 F NaN 2020 10000
3 M NaN 2021 11000
CodePudding user response:
With pandas you should only iterate as a last resort.
Something like this should get you going:
df.groupby((df.Gender!=df.Gender.shift()).cumsum()).agg({'Year': sum, 'Income': sum, 'Gender': 'first'}).reset_index(drop=True).sort_values('Gender')
Year Income Gender
0 2020 10000 F
2 6063 34000 F
1 2021 11000 M
3 2030 5000 M
Let's take that apart:
df.groupby((df.Gender!=df.Gender.shift()))
will create groups of rows, where Gender is the same for consecutive rows. I.e. the groups will be [0], [1], [2,3,4], [5]
.cumsum()
will return a cumulative sum over each of the groups.
With df.groupby((df.Gender!=df.Gender.shift()).cumsum()).agg({'Year': sum, 'Income': sum, 'Gender': 'first'})
we tell pandas how we want to calculate the cumulative sum for each column.
In this case, sum Year (whatever sense that makes) and Income, and just take the first value of Gender from each group (as we have grouped by Gender the first value will be the same as all other values for this group.
Finally reset_index(drop=True).sort_values('Gender')
will remove the artificial index column, which groupby
has created and sort the value by Gender, so that F
comes before M