I have a CSV file with more than 10,000,000 rows of data with below structures:
I have a TeamID as my unique code, in each TeamID 2 members(Kevin,Smith) or 3 members (Kevin,John,Smith) can exist.
Data Format
TeamID Members MemberType
Team1 Kevin Gold
Team1 John Silver
Team1 Smith Bronze
Team2 Kevin Gold
Team2 Smith Bronze
Team3 Kevin Gold
Team3 Smith Bronze
...
As you see above in some Teams John exists and in some other rows doesn't exist, but Kevin and Smith are always available in all teams. For determining relationship, Kevin is always Parent of John and John is always Parent of Smith but If John doesn't exist Kevin would be the parent of Smith:
TeamID Members MemberType Parent
Team1 John Silver Kevin
Team1 Smith Bronze John
Team2 Smith Bronze Kevin
Team3 Smith Bronze Kevin
...
With this code below Kevin is always Parent and I don't consider John, I want to know if there is any way to consider this with above condition:
df["Parent"] = np.where(df["Members"].isin(['Kevin','John','Smith']), "Kevin", "-")
CodePudding user response:
You can sort by MemberType, shift per group, and join:
order = ['Gold', 'Silver', 'Bronze']
key = pd.Series({x: i for i,x in enumerate(order)})
df2 = df.sort_values(by='MemberType', key=key.get)
df.join(df2.groupby('TeamID')['Members'].shift().dropna().rename('Parent'),
how='right')
output:
TeamID Members MemberType Parent
1 Team1 John Silver Kevin
2 Team1 Smith Bronze John
4 Team2 Smith Bronze Kevin
6 Team3 Smith Bronze Kevin