There are two Dataframes. The first one has the contract id numbers and the names. The second one has the contract id numbers and the transaction types. The first Dataframe is
contract id | first name | last name |
---|---|---|
1 | John | Smith |
2 | Rob | Brown |
3 | Rob | Brown |
and the second DataFrame is
contract id | transaction |
---|---|
1 | cash |
1 | cash |
1 | cash |
2 | bank transfer |
2 | bank transfer |
2 | bank transfer |
3 | cash |
I want to count the number of individuals who have used only 1 type of transaction type. In the example, there are two individuals. The first one has only used cash to pay and the second one has used both bank transfer and cash. So, the answer would be 1.
The DataFrames are large and joining them together is not viable. What other options are there?
Data:
df1
:
{'contract id': [1, 2, 3],
'first name': ['John', 'Rob', 'Rob'],
'last name': ['Smith', 'Brown', 'Brown']}
df2
:
{'contract id': [1, 1, 1, 2, 2, 2, 3],
'transaction': ['cash', 'cash', 'cash', 'bank transfer',
'bank transfer', 'bank transfer', 'cash']}
CodePudding user response:
You could create a single column in df1
for names; then map
names to contract ids in df2
. If you have a lot of duplicates values in df2
, it might be worth it to drop_duplicates
first. Then use value_counts
eq
sum
on the "name" column to count the number of people who had a single type of transaction:
mapping = df1.assign(name=df1['first name'] ' ' df1['last name']).set_index('contract id')['name']
df2 = df2.drop_duplicates().copy()
df2['name'] = df2['contract id'].map(mapping)
out = df2['name'].value_counts().eq(1).sum()
Another option is, groupby
the names and build a boolean mask to filter the names (but I suspect this will be slower than the other method).
df2['transaction'].groupby(df2['contract id'].map(mapping)).nunique().eq(1).sum()
Output:
1
CodePudding user response:
One solution with merging groupby:
# merge of the 2 datasets based on the common column to get one table with all the information
# for the real dataset may have to be more precise in the type of merging (left, outer, ...)
data = df1.merge(df2)
data['name'] = data['first name'] data['last name'] # to get "unique" full names
dfg = (data.groupby('name')['transaction'] # group the data by name and provide the column transaction
.unique() # for which we take the list of unique values for each name
.apply(lambda x: len(x)) # then we get the number of elements in the lists
)
res = dfg.index[dfg.values == 1].tolist() # list of the names for which the value is 1