Home > Software design >  Assign brackets to values in pandas dataframe
Assign brackets to values in pandas dataframe

Time:07-31

I have a dataframe that looks like this:

dict = {'industryId': {0: '1730B' , 1: '1730B', 2: '1730B', 3: '1730B', 4: '3524A', 5: '3524A', 6: '3524A', 7: '3524A'},
 'year': {0: 2017, 1: 2018, 2: 2019, 3: 2020, 4: 2017, 5: 2018, 6: 2019, 7: 2020},
 'value': {0: 500, 1: 512, 2: 370, 3: 490, 4: 600, 5: 610, 6: 630, 7: 290}}

df = pd.DataFrame(dict)

    industryId  year    value
0   1730B       2017    500
1   1730B       2018    512
2   1730B       2019    370
3   1730B       2020    490
4   3524A       2017    600
5   3524A       2018    610
6   3524A       2019    630
7   3524A       2020    290

I want to use the quantile function to assign a bracket to each value across each year, by comparing the industryId values against each other.

The output df should look like this:

    industryId  year    value  bracket
0   1730B       2017    500      0
1   3524A       2017    600      1
2   1730B       2018    512      0
3   3524A       2018    610      1
4   1730B       2019    370      0
5   3524A       2019    630      1
6   3524A       2020    290      0
7   1730B       2020    490      1

My code looks like this:

df_cutoffvals = df.groupby(['year','industryId'])['value'] \
                         .quantile(q=[i for i in range(0,2)]) \
                         .reset_index()

However, it duplicates all the values and I don't know how to correct this. Here is the output of my code:

    year    industryId  level_2 value
0   2017    1730B          0    500.00
1   2017    1730B          1    500.00
2   2017    3524A          0    600.00
3   2017    3524A          1    600.00
4   2018    1730B          0    512.00
5   2018    1730B          1    512.00
6   2018    3524A          0    610.00
7   2018    3524A          1    610.00
8   2019    1730B          0    370.00
9   2019    1730B          1    370.00
10  2019    3524A          0    630.00
11  2019    3524A          1    630.00
12  2020    1730B          0    490.00
13  2020    1730B          1    490.00
14  2020    3524A          0    290.00
15  2020    3524A          1    290.00

Does anyone have any suggestion how to get from this to my desired output?

CodePudding user response:

In your code you groupby using the index as both 'year' and 'industryId', so there is a duplicate instance of every pair. You might only need to groupby using 'year' only.

import pandas as pd 

dict = {'industryId': {0: '1730B' , 1: '1730B', 2: '1730B', 3: '1730B', 4: '3524A', 5: '3524A', 6: '3524A', 7: '3524A'},
 'year': {0: 2017, 1: 2018, 2: 2019, 3: 2020, 4: 2017, 5: 2018, 6: 2019, 7: 2020},
 'value': {0: 500, 1: 512, 2: 370, 3: 490, 4: 600, 5: 610, 6: 630, 7: 290}}
df = pd.DataFrame(dict)

df_cutoffvals = df.groupby(['year'])[['value']].quantile(q=[i for i in range(0,2)]).reset_index()
df_final = pd.merge(df_cutoffvals, df, how='inner', on=['year', 'value'])
print(df_final)

Output

   year  level_1  value industryId
0  2017      0.0  500.0      1730B
1  2017      1.0  600.0      3524A
2  2018      0.0  512.0      1730B
3  2018      1.0  610.0      3524A
4  2019      0.0  370.0      1730B
5  2019      1.0  630.0      3524A
6  2020      0.0  290.0      3524A
7  2020      1.0  490.0      1730B
  • Related