Pandas Query Multiple Dataframes-CodePudding

There are two Dataframes. The first one has the contract id numbers and the names. The second one has the contract id numbers and the transaction types. The first Dataframe is

contract id	first name	last name
1	John	Smith
2	Rob	Brown
3	Rob	Brown

and the second DataFrame is

contract id	transaction
1	cash
1	cash
1	cash
2	bank transfer
2	bank transfer
2	bank transfer
3	cash

I want to count the number of individuals who have used only 1 type of transaction type. In the example, there are two individuals. The first one has only used cash to pay and the second one has used both bank transfer and cash. So, the answer would be 1.

The DataFrames are large and joining them together is not viable. What other options are there?

Data:

df1:

{'contract id': [1, 2, 3],
 'first name': ['John', 'Rob', 'Rob'],
 'last name': ['Smith', 'Brown', 'Brown']}

df2:

{'contract id': [1, 1, 1, 2, 2, 2, 3],
 'transaction': ['cash', 'cash', 'cash', 'bank transfer',
                 'bank transfer', 'bank transfer', 'cash']}

CodePudding user response：

You could create a single column in df1 for names; then map names to contract ids in df2. If you have a lot of duplicates values in df2, it might be worth it to drop_duplicates first. Then use value_counts eq sum on the "name" column to count the number of people who had a single type of transaction:

mapping = df1.assign(name=df1['first name']   ' '   df1['last name']).set_index('contract id')['name']
df2 = df2.drop_duplicates().copy()
df2['name'] = df2['contract id'].map(mapping)
out = df2['name'].value_counts().eq(1).sum()

Another option is, groupby the names and build a boolean mask to filter the names (but I suspect this will be slower than the other method).

df2['transaction'].groupby(df2['contract id'].map(mapping)).nunique().eq(1).sum()

Output:

CodePudding user response：

One solution with merging groupby:

# merge of the 2 datasets based on the common column to get one table with all the information
# for the real dataset may have to be more precise in the type of merging (left, outer, ...)
data = df1.merge(df2) 
data['name'] = data['first name']   data['last name']  # to get "unique" full names 

dfg = (data.groupby('name')['transaction']  # group the data by name and provide the column transaction
           .unique()                        # for which we take the list of unique values for each name
           .apply(lambda x: len(x))         # then we get the number of elements in the lists
)
res = dfg.index[dfg.values == 1].tolist()   # list of the names for which the value is 1