Home > Software design >  How to create a ranking variable/function for different periods in a panel data?
How to create a ranking variable/function for different periods in a panel data?

Time:05-27

I have a dataset, df, that looks like this:

Date Code City State Population Quantity QTDPERCAPITA
2020-01 11001 Los Angeles CA 5000000 100000 0.02
2020-02 11001 Los Angeles CA 5000000 125000 0.025
2020-03 11001 Los Angeles CA 5000000 135000 0.027
2020-01 12002 Houston TX 3000000 150000 0.05
2020-02 12002 Houston TX 3000000 100000 0.033
2020-03 12002 Houston TX 3000000 200000 0.066
... ... ... ... ... ... ...
2021-07 11001 Los Angeles CA 5500499 340000 0.062
2021-07 12002 Houston TX 3250012 211000 0.065

WhereQTDPERCAPITA is simply Quantity/Population. I have multiple cities (4149 to be more precise).

The quantities change according to every month, and so does the population.

I would like to create a new variable that serve as a ranking, ranging from [0,1], where 0 is the city with the lowest QTDPERCAPITA in that month, and 1 is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
... ... ... ... ... ... ... ...
2021-07 11001 Los Angeles CA 5500499 340000 0.062 0
2021-07 12002 Houston TX 3250012 211000 0.065 1

How can I create this column such that the RANKING changes every month? I was thinking of a for loop that extracts the QTDPERCAPITA for every city on every unique date, and creates a new column, df['RANKING'] with the same date and city.

CodePudding user response:

You can use:

# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)

# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)

# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')

Output:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
2021-07 11001 Los Angeles CA 5500499 340000 0.618 1
2021-07 12002 Houston TX 3250012 211000 0.065 0

CodePudding user response:

Try this:

# get unique values for each month
months = list(set(df['Date'].values))

# initialze a new column the length of our dataframe
rank_col = [None] * 9

for month in months:
  # get all the QTDPERCAPITA values for current month
  month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']

  # normalize such that best city has 1 and worst has 0 for this month
  max = month_qtdpc.max()
  min = month_qtdpc.min()
  rankings = (month_qtdpc - min) / (max - min)

  # insert each rank value at the proper index in our list 
  for i, rank in rankings.iteritems(): 
    rank_col[i] = rank

# add it as a column to the dataframe
df['RANKING'] = rank_col

Note: This will give you the score, not just the rank. For example, with three cities the output would be (fake data):

index Date City QTDPERCAPITA RANKING
0 2020-01 Los Angeles 0.02 0.0
1 2020-02 Los Angeles 0.025 0.33
2 2020-03 Los Angeles 0.027 0.0
3 2020-01 Houston 0.05 1.0
4 2020-02 Houston 0.033 1.0
5 2020-03 Houston 0.066 1.0
6 2020-01 Denver 0.03 0.33
7 2020-02 Denver 0.021 0.0
8 2020-03 Denver 0.056 0.74

With Corralien's answer, it will give you true normalized rankings:

index Date City QTDPERCAPITA RANKING
0 2020-01 Los Angeles 0.02 0.0
1 2020-02 Los Angeles 0.025 0.5
2 2020-03 Los Angeles 0.027 0.0
3 2020-01 Houston 0.05 1.0
4 2020-02 Houston 0.033 1.0
5 2020-03 Houston 0.066 1.0
6 2020-01 Denver 0.03 0.5
7 2020-02 Denver 0.021 0.0
8 2020-03 Denver 0.056 0.5
  • Related