Home > Enterprise >  How to count the numbers of rows based on an id column that have shifted from one category to anothe
How to count the numbers of rows based on an id column that have shifted from one category to anothe

Time:10-19

I have a pandas dataframe like this:

         id    year_group  category
0        8300           1     low
1        8300           2     medium
2       11725           1     low
3       11725           2     low
4       18068           1     medium
...       ...         ...       ...
59354   18962           1     low
59355   11669           1     low
59356   13110           3     low
59357    2378           1     low
59358   19363           1     low

[59359 rows x 3 columns]

I'm trying to determine how many ids shifted from one category to another based on a year (year_group column). For example, for id 8300, a shift from year 1 to year 2 shows a category change from low to medium (in that order). I want to count the number of times this happens between year_groups 1 to 3, for each category.

I'm not sure how to complete this. At the moment I have considered dropping each year_group and making separate data frames before trying to group everything, like this:

# year 1 and 2 
years_1_2  = df.drop(df[df.year_group == 3].index)

However, I don't know how to make sure the grouping is based on positions of the year_groups, i.e. year 1 going to year 2 and not year 2 going to year 1.

Maybe there is something more streamlined that I can do. Perhaps making use of np.where... Any suggestions on how best to tackle this problem?

CodePudding user response:

Maybe this can get you started. It's a bit verbose, but clear, I believe. Comments in line

# added and modified data for just 2 years
data='''         id    year_group  category
0        8300           1     low
1        8300           2     medium
2       11725           1     low
3       11725           2     low
4       18068           1     medium
5       18068           2     low
6   18962           1     low
7   18962           2     low
8        21           1     low
9        21           2     medium'''
df = pd.read_csv(io.StringIO(data), sep=' \s ', engine='python')

# sort to keep ids and year_groups ascending
df.sort_values(['id', 'year_group'], ascending=[True, True], inplace=True)

      id  year_group category
8     21           1      low
9     21           2   medium
0   8300           1      low
1   8300           2   medium
2  11725           1      low
3  11725           2      low
4  18068           1   medium
5  18068           2      low
6  18962           1      low
7  18962           2      low

# if you have year 3, this will only take years 1 and 2
# if a grouping has a count of 2, that means there is no change from one year to the next, so drop everything that didn't change
dft = df[df[df['year_group'] != 3].groupby(['id', 'category'])['year_group'].transform('count') < 2]

      id  year_group category
8     21           1      low
9     21           2   medium
0   8300           1      low
1   8300           2   medium
4  18068           1   medium
5  18068           2      low

# making lists that show movement from low - medium, medium - low, etc...
yearlychanges = dft.groupby('id')['category'].apply(list).reset_index()

      id       category
0     21  [low, medium]
1   8300  [low, medium]
2  18068  [medium, low]

# convert lists to strings for counting
yearlychanges['changes'] = yearlychanges.apply(lambda x: '-'.join(x['category']), axis=1)

      id       category     changes
0     21  [low, medium]  low-medium
1   8300  [low, medium]  low-medium
2  18068  [medium, low]  medium-low

# count number of changes
yearlychanges[['changes', 'id']].groupby('changes').count()

            id
changes
low-medium   2
medium-low   1

CodePudding user response:

If I understood you correctly:

Setup

df = pd.DataFrame(data={{'id': [8300, 8300, 8300, 8301, 8301, 8301], 'year_group': [1, 2, 3, 1, 2, 3], 'category': ['low', 'medium', 'low', 'low', 'medium', 'low']}})

Code

df['shift'] = df.groupby('id')['category'].apply(lambda x: x   ' '   x.shift(-1))
shifts = df.dropna(subset=['shift']).groupby(['shift']).size()

Input

         id  year_group category
0      8300           1      low
1      8300           2   medium
2      8300           3      low
59355  8301           1      low
59356  8301           2   medium
59357  8301           3      low

Output:

shift
low medium    2
medium low    2
  • Related