The task is the following. It's required to find the top 5 routes that were most often delayed and count how many times they were delayed due to weather conditions.
There is the list of flights:
FlightNum CancellationCode
1 "B"
1 NA
1 NA
2 NA
2 "A"
2 "A"
3 NA
3 NA
3 NA
4 "B"
4 "B"
4 "B"
5 NA
5 "A"
5 "B"
6 "A"
6 "A"
6 "A"
6 "B"
7 "A"
7 "B"
7 "B"
CancellationCode is the reason of delaying. "A" - Carrier, "B" - weather, NA - departed in time. I wrote code which finds the top 5 routes that were most often delayed.
data[(data.CancellationCode.notnull())]['FlightNum'].value_counts()[:5]
Result:
6: 4
7: 3
4: 3
5: 2
2: 2
Now it's required to show the number of delayed flights due to the weather ("B") of these FlightNum's. The result must be the following:
6: 1
7: 2
4: 3
5: 1
2: 0
How my code could be improved?
CodePudding user response:
Here is a way. First get the value_counts
when it is due to weather and reindex
with the index of the current solution you have to get only the top 5 routes.
res = (
data.loc[data['CancellationCode'].eq('"B"'), 'FlightNum'].value_counts()
.reindex(data.loc[data['CancellationCode'].notnull(), 'FlightNum']
.value_counts()[:5].index,
fill_value=0)
)
print(res)
# 6 1
# 4 3
# 7 2
# 2 0
# 5 1
# Name: FlightNum, dtype: int64
CodePudding user response:
it could be very informative to use pivot table:
table = df.dropna().assign(n=1).pivot_table(index='FlightNum',
columns='CancellationCode',
aggfunc='sum',
margins=True,
fill_value=0).droplevel(0,1)
>>> table
'''
CancellationCode "A" "B" All
FlightNum
1 0 1 1
2 2 0 2
4 0 3 3
5 1 1 2
6 3 1 4
7 1 2 3
All 7 8 15
'''
# the top 5 routes that were most often delayed
table.drop('All').nlargest(5,'All')
>>> out
'''
CancellationCode "A" "B" All
FlightNum
6 3 1 4
4 0 3 3
7 1 2 3
2 2 0 2
5 1 1 2
'''
# or to show only the number of delayed flights due to the weather ("B")
table.drop('All').nlargest(5,'All')['"B"']
>>> out
'''
FlightNum
6 1
4 3
7 2
2 0
5 1
Name: "B", dtype: int64