I am trying to create a tally that alternates the Region by one row when the next ID is run. I am expecting one result per each ID.
Tried a few methods but nothing seems to be working and running short on ideas.
Data Set
ID | Region |
---|---|
1 | North |
1 | South |
1 | East |
1 | West |
2 | North |
2 | South |
2 | East |
2 | West |
3 | North |
3 | South |
3 | East |
3 | West |
4 | North |
4 | South |
4 | East |
4 | West |
5 | Northwest |
5 | South West |
6 | Northwest |
6 | South West |
Expected Output
ID | Region |
---|---|
1 | North |
2 | South |
3 | East |
4 | West |
5 | Northwest |
6 | South West |
CodePudding user response:
You can factorize
both columns and keep the rows for which the ranks are equal:
out = df.loc[pd.factorize(df['Region'])[0] == pd.factorize(df['ID'])[0]]
Output:
ID Region
0 1 North
5 2 South
10 3 East
15 4 West
16 5 Northwest
19 6 South West
Other idea, what about using an intermediate rectangular matrix and take its diagonal?
import numpy as np
df2 = (df.pivot(index='ID', columns='Region', values='Region')
.reindex(index=df['ID'].unique(), columns=df['Region'].unique())
)
out = pd.DataFrame({'ID': df2.index, 'Region': np.diag(df2)})
Output:
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
Intermediate rectangular matrix:
Region North South East West Northwest South West
ID
1 North South East West NaN NaN
2 North South East West NaN NaN
3 North South East West NaN NaN
4 North South East West NaN NaN
5 NaN NaN NaN NaN Northwest South West
6 NaN NaN NaN NaN Northwest South West
CodePudding user response:
I think the method ".drop_duplicates" might solve your problem. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
I think you should use the "subset" parameter, as indicated in documentation of the above link.
CodePudding user response:
The unique region names can be collected into a set, then for each group (grouped by ID
) the next available region is extracted:
def get_next_region(r_group, regions):
for reg in r_group:
if reg in regions:
regions.remove(reg)
break
return reg
regions = set(df.Region.unique()) # unique regions
reg_df = df.groupby('ID')['Region'].apply(get_next_region, regions=regions.copy())\
.reset_index(name="Region")
print(reg_df)
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
CodePudding user response:
Here is a potential solution. It does result in the desired output for the example you gave, but would not necessarily generalize to any combination of ID values or combinations of Regions. It may be useful if you are looking for pandas methods that don't require setting up a loop.
import pandas as pd
import math
# Create pandas DataFrame
df = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6],
'Region':['North','South','East','West',
'North','South','East','West',
'North','South','East','West',
'North','South','East','West',
'Northwest','South West','Northwest','South West'
]
})
# A method of indexing the regions per ID from 1 to n regions
df = df.reset_index().rename(columns={'index':'Region Idx'})
offset = df['Region Idx']*df['ID'].diff()
offset.loc[offset==0] = math.nan
offset = offset.ffill().fillna(0)
df['Region Idx'] = df['Region Idx'] - offset 1
# Assuming your IDs are integers in ascending order
# and you don't have any special cases with the number of regions per ID,
# this can be used for the rolling region selection per ID
df['Max Region'] = df[['ID','Region Idx']].groupby('ID').transform('max')
df['Selected Region Idx'] = (df['ID']-1) % df['Max Region'] 1
# Final result
result = df.loc[df['Region Idx']==df['Selected Region Idx'],['ID','Region']]
CodePudding user response:
A possible solution
unique_ids = df['ID'].unique()
res_df = pd.DataFrame(columns=['ID', 'Region'])
for i, id in enumerate(unique_ids):
region = df[df['ID'] == id]['Region'].iloc[i % 4]
res_df = pd.concat([res_df, pd.DataFrame({'ID': [id], 'Region': [region]})],
ignore_index=True)
print(res_df)
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
CodePudding user response:
Another thought using groupby.ngroup
and drop_duplicates
:
id_idx = df.groupby('ID').ngroup().drop_duplicates().index
region_idx = df.groupby('Region').ngroup().drop_duplicates().index
out = pd.DataFrame()
out['ID'], out['Region'] = df.loc[id_idx, 'ID'].values, df.loc[region_idx, 'Region'].values
print(out)
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West