I have a dataframe that looks like this:
dict = {'industryId': {0: '1730B' , 1: '1730B', 2: '1730B', 3: '1730B', 4: '3524A', 5: '3524A', 6: '3524A', 7: '3524A'},
'year': {0: 2017, 1: 2018, 2: 2019, 3: 2020, 4: 2017, 5: 2018, 6: 2019, 7: 2020},
'value': {0: 500, 1: 512, 2: 370, 3: 490, 4: 600, 5: 610, 6: 630, 7: 290}}
df = pd.DataFrame(dict)
industryId year value
0 1730B 2017 500
1 1730B 2018 512
2 1730B 2019 370
3 1730B 2020 490
4 3524A 2017 600
5 3524A 2018 610
6 3524A 2019 630
7 3524A 2020 290
I want to use the quantile function to assign a bracket to each value across each year, by comparing the industryId
values against each other.
The output df should look like this:
industryId year value bracket
0 1730B 2017 500 0
1 3524A 2017 600 1
2 1730B 2018 512 0
3 3524A 2018 610 1
4 1730B 2019 370 0
5 3524A 2019 630 1
6 3524A 2020 290 0
7 1730B 2020 490 1
My code looks like this:
df_cutoffvals = df.groupby(['year','industryId'])['value'] \
.quantile(q=[i for i in range(0,2)]) \
.reset_index()
However, it duplicates all the values and I don't know how to correct this. Here is the output of my code:
year industryId level_2 value
0 2017 1730B 0 500.00
1 2017 1730B 1 500.00
2 2017 3524A 0 600.00
3 2017 3524A 1 600.00
4 2018 1730B 0 512.00
5 2018 1730B 1 512.00
6 2018 3524A 0 610.00
7 2018 3524A 1 610.00
8 2019 1730B 0 370.00
9 2019 1730B 1 370.00
10 2019 3524A 0 630.00
11 2019 3524A 1 630.00
12 2020 1730B 0 490.00
13 2020 1730B 1 490.00
14 2020 3524A 0 290.00
15 2020 3524A 1 290.00
Does anyone have any suggestion how to get from this to my desired output?
CodePudding user response:
In your code you groupby using the index as both 'year' and 'industryId', so there is a duplicate instance of every pair. You might only need to groupby using 'year' only.
import pandas as pd
dict = {'industryId': {0: '1730B' , 1: '1730B', 2: '1730B', 3: '1730B', 4: '3524A', 5: '3524A', 6: '3524A', 7: '3524A'},
'year': {0: 2017, 1: 2018, 2: 2019, 3: 2020, 4: 2017, 5: 2018, 6: 2019, 7: 2020},
'value': {0: 500, 1: 512, 2: 370, 3: 490, 4: 600, 5: 610, 6: 630, 7: 290}}
df = pd.DataFrame(dict)
df_cutoffvals = df.groupby(['year'])[['value']].quantile(q=[i for i in range(0,2)]).reset_index()
df_final = pd.merge(df_cutoffvals, df, how='inner', on=['year', 'value'])
print(df_final)
Output
year level_1 value industryId
0 2017 0.0 500.0 1730B
1 2017 1.0 600.0 3524A
2 2018 0.0 512.0 1730B
3 2018 1.0 610.0 3524A
4 2019 0.0 370.0 1730B
5 2019 1.0 630.0 3524A
6 2020 0.0 290.0 3524A
7 2020 1.0 490.0 1730B