How to count some data under certain conditions-CodePudding

The task is the following. It's required to find the top 5 routes that were most often delayed and count how many times they were delayed due to weather conditions.

There is the list of flights:

FlightNum CancellationCode
1 "B"
1 NA
1 NA
2 NA
2 "A"
2 "A"
3 NA
3 NA
3 NA
4 "B"
4 "B"
4 "B"
5 NA
5 "A"
5 "B"
6 "A"
6 "A"
6 "A"
6 "B"
7 "A"
7 "B"
7 "B"

CancellationCode is the reason of delaying. "A" - Carrier, "B" - weather, NA - departed in time. I wrote code which finds the top 5 routes that were most often delayed.

data[(data.CancellationCode.notnull())]['FlightNum'].value_counts()[:5]

Result:
6: 4
7: 3
4: 3
5: 2
2: 2

Now it's required to show the number of delayed flights due to the weather ("B") of these FlightNum's. The result must be the following:

6: 1
7: 2
4: 3
5: 1
2: 0

How my code could be improved?

CodePudding user response：

Here is a way. First get the value_counts when it is due to weather and reindex with the index of the current solution you have to get only the top 5 routes.

res = (
    data.loc[data['CancellationCode'].eq('"B"'), 'FlightNum'].value_counts()
    .reindex(data.loc[data['CancellationCode'].notnull(), 'FlightNum']
                 .value_counts()[:5].index,
             fill_value=0)
)
print(res)
# 6    1
# 4    3
# 7    2
# 2    0
# 5    1
# Name: FlightNum, dtype: int64

CodePudding user response：

it could be very informative to use pivot table:

table = df.dropna().assign(n=1).pivot_table(index='FlightNum',
                                            columns='CancellationCode',
                                            aggfunc='sum',
                                            margins=True,
                                            fill_value=0).droplevel(0,1)
>>> table
'''
CancellationCode  "A"  "B"  All
FlightNum                      
1                   0    1    1
2                   2    0    2
4                   0    3    3
5                   1    1    2
6                   3    1    4
7                   1    2    3
All                 7    8   15
'''
# the top 5 routes that were most often delayed
table.drop('All').nlargest(5,'All')

>>> out
'''
CancellationCode  "A"  "B"  All
FlightNum                      
6                   3    1    4
4                   0    3    3
7                   1    2    3
2                   2    0    2
5                   1    1    2
'''
# or to show only the number of delayed flights due to the weather ("B")
table.drop('All').nlargest(5,'All')['"B"']

>>> out
'''
FlightNum
6    1
4    3
7    2
2    0
5    1
Name: "B", dtype: int64