How to create a ranking variable/function for different periods in a panel data?-CodePudding

I have a dataset, df, that looks like this:

Date	Code	City	State	Population	Quantity	QTDPERCAPITA
2020-01	11001	Los Angeles	CA	5000000	100000	0.02
2020-02	11001	Los Angeles	CA	5000000	125000	0.025
2020-03	11001	Los Angeles	CA	5000000	135000	0.027
2020-01	12002	Houston	TX	3000000	150000	0.05
2020-02	12002	Houston	TX	3000000	100000	0.033
2020-03	12002	Houston	TX	3000000	200000	0.066
...	...	...	...	...	...	...
2021-07	11001	Los Angeles	CA	5500499	340000	0.062
2021-07	12002	Houston	TX	3250012	211000	0.065

WhereQTDPERCAPITA is simply Quantity/Population. I have multiple cities (4149 to be more precise).

The quantities change according to every month, and so does the population.

I would like to create a new variable that serve as a ranking, ranging from [0,1], where 0 is the city with the lowest QTDPERCAPITA in that month, and 1 is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:

Date	Code	City	State	Population	Quantity	QTDPERCAPITA	RANKING
2020-01	11001	Los Angeles	CA	5000000	100000	0.02	0
2020-02	11001	Los Angeles	CA	5000000	125000	0.025	0
2020-03	11001	Los Angeles	CA	5000000	135000	0.027	0
2020-01	12002	Houston	TX	3000000	150000	0.05	1
2020-02	12002	Houston	TX	3000000	100000	0.033	1
2020-03	12002	Houston	TX	3000000	200000	0.066	1
...	...	...	...	...	...	...	...
2021-07	11001	Los Angeles	CA	5500499	340000	0.062	0
2021-07	12002	Houston	TX	3250012	211000	0.065	1

How can I create this column such that the RANKING changes every month? I was thinking of a for loop that extracts the QTDPERCAPITA for every city on every unique date, and creates a new column, df['RANKING'] with the same date and city.

CodePudding user response：

You can use:

# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)

# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)

# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')

Output:

Date	Code	City	State	Population	Quantity	QTDPERCAPITA	RANKING
2020-01	11001	Los Angeles	CA	5000000	100000	0.02	0
2020-02	11001	Los Angeles	CA	5000000	125000	0.025	0
2020-03	11001	Los Angeles	CA	5000000	135000	0.027	0
2020-01	12002	Houston	TX	3000000	150000	0.05	1
2020-02	12002	Houston	TX	3000000	100000	0.033	1
2020-03	12002	Houston	TX	3000000	200000	0.066	1
2021-07	11001	Los Angeles	CA	5500499	340000	0.618	1
2021-07	12002	Houston	TX	3250012	211000	0.065	0

CodePudding user response：

Try this:

# get unique values for each month
months = list(set(df['Date'].values))

# initialze a new column the length of our dataframe
rank_col = [None] * 9

for month in months:
  # get all the QTDPERCAPITA values for current month
  month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']

  # normalize such that best city has 1 and worst has 0 for this month
  max = month_qtdpc.max()
  min = month_qtdpc.min()
  rankings = (month_qtdpc - min) / (max - min)

  # insert each rank value at the proper index in our list 
  for i, rank in rankings.iteritems(): 
    rank_col[i] = rank

# add it as a column to the dataframe
df['RANKING'] = rank_col

Note: This will give you the score, not just the rank. For example, with three cities the output would be (fake data):

index	Date	City	QTDPERCAPITA	RANKING
0	2020-01	Los Angeles	0.02	0.0
1	2020-02	Los Angeles	0.025	0.33
2	2020-03	Los Angeles	0.027	0.0
3	2020-01	Houston	0.05	1.0
4	2020-02	Houston	0.033	1.0
5	2020-03	Houston	0.066	1.0
6	2020-01	Denver	0.03	0.33
7	2020-02	Denver	0.021	0.0
8	2020-03	Denver	0.056	0.74

With Corralien's answer, it will give you true normalized rankings:

index	Date	City	QTDPERCAPITA	RANKING
0	2020-01	Los Angeles	0.02	0.0
1	2020-02	Los Angeles	0.025	0.5
2	2020-03	Los Angeles	0.027	0.0
3	2020-01	Houston	0.05	1.0
4	2020-02	Houston	0.033	1.0
5	2020-03	Houston	0.066	1.0
6	2020-01	Denver	0.03	0.5
7	2020-02	Denver	0.021	0.0
8	2020-03	Denver	0.056	0.5