I have a CSV file with more than 10,000,000 rows of data with below structures: I have an ID as my uniqueID per group:
Data Format
ID Type Name
1 Head abc-001
1 Senior abc-002
1 Junior abc-003
1 Junior abc-004
2 Head abc-005
2 Senior abc-006
2 Junior abc-007
3 Head abc-008
3 Junior abc-009
...
For defining parent relationship below conditions exist:
- Each group MUST has 1 Head.
- It is OPTIONAL to have ONLY 1 Senior in each group.
- Each group MUST have AT LEAST one Junior.
EXPECTED RESULT
ID Type Name Parent
1 Senior abc-002 abc-001
1 Junior abc-003 abc-002
1 Junior abc-004 abc-002
2 Senior abc-006 abc-005
2 Junior abc-007 abc-006
3 Junior abc-009 abc-008
Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:
order = ['Head', 'Senior', 'Junior']
key = pd.Series({x: i for i,x in enumerate(order)})
df2 = df.sort_values(by='Type', key=key.get)
df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right')
print(df4)
CodePudding user response:
You could pivot the Type
and Name
columns then forword fill within ID
group. Then take the right-hand two non-NaN entries to get the Parent
and Name
.
Pivot and forward-fill:
dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1) \
.groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]
print(dfn)
ID Type Head Senior Junior
0 1 Head abc-001 NaN NaN
1 1 Senior abc-001 abc-002 NaN
2 1 Junior abc-001 abc-002 abc-003
3 1 Junior abc-001 abc-002 abc-004
4 2 Head abc-005 NaN NaN
5 2 Senior abc-005 abc-006 NaN
6 2 Junior abc-005 abc-006 abc-007
7 3 Head abc-008 NaN NaN
8 3 Junior abc-008 NaN abc-009
A function to pull the last two non-NaN entries:
def get_np(x):
rc = [np.nan,np.nan]
if x.isna().sum() != 2:
if x.isna().sum() == 0:
rc = [x['Junior'],x['Senior']]
elif pd.isna(x['Junior']):
rc = [x['Senior'],x['Head']]
else:
rc = [x['Junior'],x['Head']]
return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
Apply it and drop the non-applicable rows:
dfn.apply(get_np, axis=1).dropna()
ID Type Name Parent
1 1 Senior abc-002 abc-001
2 1 Junior abc-003 abc-002
3 1 Junior abc-004 abc-002
5 2 Senior abc-006 abc-005
6 2 Junior abc-007 abc-006
8 3 Junior abc-009 abc-008