I have a dataset, df
, that looks like this:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA |
---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 |
... | ... | ... | ... | ... | ... | ... |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 |
WhereQTDPERCAPITA
is simply Quantity/Population
. I have multiple cities (4149 to be more precise).
The quantities change according to every month, and so does the population.
I would like to create a new variable that serve as a ranking, ranging from [0,1]
, where 0
is the city with the lowest QTDPERCAPITA
in that month, and 1
is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
---|---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 | 0 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 1 |
How can I create this column such that the RANKING
changes every month? I was thinking of a for
loop that extracts the QTDPERCAPITA
for every city on every unique date, and creates a new column, df['RANKING']
with the same date
and city
.
CodePudding user response:
You can use:
# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)
# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)
# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')
Output:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
---|---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.618 | 1 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 0 |
CodePudding user response:
Try this:
# get unique values for each month
months = list(set(df['Date'].values))
# initialze a new column the length of our dataframe
rank_col = [None] * 9
for month in months:
# get all the QTDPERCAPITA values for current month
month_qtdpc = df[df['Date'] == month]['QTDPERCAPITA']
# normalize such that best city has 1 and worst has 0 for this month
max = month_qtdpc.max()
min = month_qtdpc.min()
rankings = (month_qtdpc - min) / (max - min)
# insert each rank value at the proper index in our list
for i, rank in rankings.iteritems():
rank_col[i] = rank
# add it as a column to the dataframe
df['RANKING'] = rank_col
Note: This will give you the score, not just the rank. For example, with three cities the output would be (fake data):
index | Date | City | QTDPERCAPITA | RANKING |
---|---|---|---|---|
0 | 2020-01 | Los Angeles | 0.02 | 0.0 |
1 | 2020-02 | Los Angeles | 0.025 | 0.33 |
2 | 2020-03 | Los Angeles | 0.027 | 0.0 |
3 | 2020-01 | Houston | 0.05 | 1.0 |
4 | 2020-02 | Houston | 0.033 | 1.0 |
5 | 2020-03 | Houston | 0.066 | 1.0 |
6 | 2020-01 | Denver | 0.03 | 0.33 |
7 | 2020-02 | Denver | 0.021 | 0.0 |
8 | 2020-03 | Denver | 0.056 | 0.74 |
With Corralien's answer, it will give you true normalized rankings:
index | Date | City | QTDPERCAPITA | RANKING |
---|---|---|---|---|
0 | 2020-01 | Los Angeles | 0.02 | 0.0 |
1 | 2020-02 | Los Angeles | 0.025 | 0.5 |
2 | 2020-03 | Los Angeles | 0.027 | 0.0 |
3 | 2020-01 | Houston | 0.05 | 1.0 |
4 | 2020-02 | Houston | 0.033 | 1.0 |
5 | 2020-03 | Houston | 0.066 | 1.0 |
6 | 2020-01 | Denver | 0.03 | 0.5 |
7 | 2020-02 | Denver | 0.021 | 0.0 |
8 | 2020-03 | Denver | 0.056 | 0.5 |