I have two dataframes (members and expeditions). In expeditions, there is a column that gives the number of members (also called members) of the expedition and in members, we have the shippers each linked to an expedition_id to make the link between the two dataframes. I have calculated for each expedition_id the total number of members per expedition and I would like to compare if the number of members given in expeditions is the same as the one I calculated. Can you help me?
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
Number of members by expedition
nbre_membres_expedition = members[["expedition_id", "member_id"]].groupby("expedition_id", as_index = False).count()
nbre_membres_expedition
CodePudding user response:
To check the difference use merge
and filter row where values are different from 2 columns:
nbre_memb_exp = members.value_counts('expedition_id').rename('nbre_memb_exp')
nbre_exp_memb = expeditions.set_index('expedition_id')['members'].rename('nbre_exp_memb')
diff_df = pd.merge(nbre_memb_exp, nbre_exp_memb,
left_index=True, right_index=True, how='outer') \
.query('nbre_memb_exp != nbre_exp_memb')
Output:
>>> diff_df
nbre_memb_exp nbre_exp_memb
expedition_id
ACHN15302 11.0 9 # hired_staff=2
ACHN18301 9.0 8 # hired_staff=1
AMAD00106 3.0 1 # hired_staff=2
AMAD00110 10.0 8 # hired_staff=3 ???
AMAD00112 5.0 3 # hired_staff=2
... ... ...
YALU88301 10.0 8
YALU89301 10.0 8
YALU89401 7.0 5
YAUP13301 4.0 2
YAUP17101 9.0 6
[5431 rows x 2 columns]
I think you have to sum with the column hired_staff
. Change the previous row nbre_exp_memb = ...
by:
nbre_exp_memb = expeditions.set_index('expedition_id')[['members', 'hired_staff']] \
.sum(axis=1).rename('nbre_exp_memb')