Python: Data set that has repeating IDs with different column values. How do I select only one Row f-CodePudding

I am trying to create a tally that alternates the Region by one row when the next ID is run. I am expecting one result per each ID.

Tried a few methods but nothing seems to be working and running short on ideas.

Data Set

ID	Region
1	North
1	South
1	East
1	West
2	North
2	South
2	East
2	West
3	North
3	South
3	East
3	West
4	North
4	South
4	East
4	West
5	Northwest
5	South West
6	Northwest
6	South West

Expected Output

ID	Region
1	North
2	South
3	East
4	West
5	Northwest
6	South West

CodePudding user response：

You can factorize both columns and keep the rows for which the ranks are equal:

out = df.loc[pd.factorize(df['Region'])[0] == pd.factorize(df['ID'])[0]]

Output:

    ID      Region
0    1       North
5    2       South
10   3        East
15   4        West
16   5   Northwest
19   6  South West

Other idea, what about using an intermediate rectangular matrix and take its diagonal?

import numpy as np

df2 = (df.pivot(index='ID', columns='Region', values='Region')
         .reindex(index=df['ID'].unique(), columns=df['Region'].unique())
      )

out = pd.DataFrame({'ID': df2.index, 'Region': np.diag(df2)})

Output:

   ID      Region
0   1       North
1   2       South
2   3        East
3   4        West
4   5   Northwest
5   6  South West

Intermediate rectangular matrix:

Region  North  South  East  West  Northwest  South West
ID                                                     
1       North  South  East  West        NaN         NaN
2       North  South  East  West        NaN         NaN
3       North  South  East  West        NaN         NaN
4       North  South  East  West        NaN         NaN
5         NaN    NaN   NaN   NaN  Northwest  South West
6         NaN    NaN   NaN   NaN  Northwest  South West

CodePudding user response：

I think the method ".drop_duplicates" might solve your problem. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

I think you should use the "subset" parameter, as indicated in documentation of the above link.

CodePudding user response：

The unique region names can be collected into a set, then for each group (grouped by ID) the next available region is extracted:

def get_next_region(r_group, regions):
    for reg in r_group:
        if reg in regions:
            regions.remove(reg)
            break
    return reg

regions = set(df.Region.unique())  # unique regions
reg_df = df.groupby('ID')['Region'].apply(get_next_region, regions=regions.copy())\
    .reset_index(name="Region")

print(reg_df)

   ID      Region
0   1       North
1   2       South
2   3        East
3   4        West
4   5   Northwest
5   6  South West

CodePudding user response：

Here is a potential solution. It does result in the desired output for the example you gave, but would not necessarily generalize to any combination of ID values or combinations of Regions. It may be useful if you are looking for pandas methods that don't require setting up a loop.

import pandas as pd
import math
# Create pandas DataFrame
df = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6],
                   'Region':['North','South','East','West',
                            'North','South','East','West',
                            'North','South','East','West',
                            'North','South','East','West',
                            'Northwest','South West','Northwest','South West'
                            ]
                  })

# A method of indexing the regions per ID from 1 to n regions
df = df.reset_index().rename(columns={'index':'Region Idx'})
offset = df['Region Idx']*df['ID'].diff()
offset.loc[offset==0] = math.nan
offset = offset.ffill().fillna(0)
df['Region Idx'] = df['Region Idx'] - offset   1

# Assuming your IDs are integers in ascending order 
# and you don't have any special cases with the number of regions per ID,
# this can be used for the rolling region selection per ID
df['Max Region'] = df[['ID','Region Idx']].groupby('ID').transform('max')
df['Selected Region Idx'] = (df['ID']-1) % df['Max Region']   1

# Final result
result = df.loc[df['Region Idx']==df['Selected Region Idx'],['ID','Region']]

CodePudding user response：

A possible solution

unique_ids = df['ID'].unique()
res_df = pd.DataFrame(columns=['ID', 'Region'])

for i, id in enumerate(unique_ids):
    region = df[df['ID'] == id]['Region'].iloc[i % 4]
    res_df = pd.concat([res_df, pd.DataFrame({'ID': [id], 'Region': [region]})], 
                                                ignore_index=True)
print(res_df)

  ID      Region
0  1       North
1  2       South
2  3        East
3  4        West
4  5   Northwest
5  6  South West

CodePudding user response：

Another thought using groupby.ngroup and drop_duplicates:

id_idx = df.groupby('ID').ngroup().drop_duplicates().index
region_idx = df.groupby('Region').ngroup().drop_duplicates().index
out = pd.DataFrame()
out['ID'], out['Region'] = df.loc[id_idx, 'ID'].values, df.loc[region_idx, 'Region'].values

print(out)

   ID      Region
0   1       North
1   2       South
2   3        East
3   4        West
4   5   Northwest
5   6  South West