Compare data in two dataframes-CodePudding

I have two dataframes (members and expeditions). In expeditions, there is a column that gives the number of members (also called members) of the expedition and in members, we have the shippers each linked to an expedition_id to make the link between the two dataframes. I have calculated for each expedition_id the total number of members per expedition and I would like to compare if the number of members given in expeditions is the same as the one I calculated. Can you help me?

import pandas as pd

members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

Number of members by expedition

nbre_membres_expedition = members[["expedition_id", "member_id"]].groupby("expedition_id", as_index = False).count()

nbre_membres_expedition

CodePudding user response：

To check the difference use merge and filter row where values are different from 2 columns:

nbre_memb_exp = members.value_counts('expedition_id').rename('nbre_memb_exp')
nbre_exp_memb = expeditions.set_index('expedition_id')['members'].rename('nbre_exp_memb')

diff_df = pd.merge(nbre_memb_exp, nbre_exp_memb, 
                   left_index=True, right_index=True, how='outer') \
            .query('nbre_memb_exp != nbre_exp_memb')

Output:

>>> diff_df
               nbre_memb_exp  nbre_exp_memb
expedition_id                              
ACHN15302               11.0              9  #   hired_staff=2
ACHN18301                9.0              8  #   hired_staff=1
AMAD00106                3.0              1  #   hired_staff=2
AMAD00110               10.0              8  #   hired_staff=3 ???
AMAD00112                5.0              3  #   hired_staff=2
...                      ...            ...
YALU88301               10.0              8
YALU89301               10.0              8
YALU89401                7.0              5
YAUP13301                4.0              2
YAUP17101                9.0              6

[5431 rows x 2 columns]

I think you have to sum with the column hired_staff. Change the previous row nbre_exp_memb = ... by:

nbre_exp_memb = expeditions.set_index('expedition_id')[['members', 'hired_staff']] \
                           .sum(axis=1).rename('nbre_exp_memb')