I have the following table
Father | Son | Year |
---|---|---|
James | Harry | 1999 |
James | Alfi | 2001 |
Corey | Kyle | 2003 |
I would like to add a fourth column that makes the table look like below. It's supposed to show which child of each father was born first, second, third, and so on. How can I do that?
Father | Son | Year | Child |
---|---|---|---|
James | Harry | 1999 | 1 |
James | Alfi | 2001 | 2 |
Corey | Kyle | 2003 | 1 |
CodePudding user response:
here is one way to do it. using cumcount
# groupby Father and take a cumcount, offsetted by 1
df['Child']=df.groupby(['Father'])['Son'].cumcount() 1
df
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
it assumes that DF is sorted by Father and Year. if not, then
df['Child']=df.sort_values(['Father','Year']).groupby(['Father'] )['Son'].cumcount() 1
df
CodePudding user response:
Here is an idea of solving this using groupby
and cumsum
functions.
This assumes that the rows are ordered so that the younger sibling is always below their elder brother and all children of the same father are in a continuous pack of rows.
Assume we have the following setup
import pandas as pd
df = pd.DataFrame({'Father': ['James', 'James', 'Corey'],
'Son': ['Harry', 'Alfi', 'Kyle'],
'Year': [1999, 2001, 2003]})
then here is the trick we group the siblings with the same father into a groupby object and then compute the cumulative sum of ones to assign a sequential number to each row.
df['temp_column'] = 1
df['Child'] = df.groupby('Father')['temp_column'].cumsum()
df.drop(columns='temp_column')
The result would look like this
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
Now to make the solution more general consider reordering the rows to satisfy the preconditions before applying the solution and then if necessary restore the dataframe to the original order.