I have two dataframes,
df1 df2
country country
US AR
US AD
CA AO
CN AU
AR US
How do I group by them by combining the country list to a set the compare the difference between two dataframes?
My expected output will be like,
country code df1_country_count df2_country_count
AR 1 1
AD 0 1
AO 0 1
AU 0 1
US 2 1
CA 1 0
CN 1 0
CodePudding user response:
(df1.value_counts().to_frame('df1_country_count')
.join(df2.value_counts().to_frame('df2_country_count'), how='outer')
.fillna(0).astype('int').rename_axis('country code'))
result:
df1_country_count df2_country_count
country code
AD 0 1
AO 0 1
AR 1 1
AU 0 1
CA 1 0
CN 1 0
US 2 1
CodePudding user response:
You can use value_counts
and then concat.
out = pd.concat([df1.country.value_counts(),
df2.country.value_counts()], axis=1).fillna(0).astype(int)
out.columns = ['df1_country', 'df2_country']
print(out)
df1_country df2_country
US 2 1
CA 1 0
CN 1 0
AR 1 1
AD 0 1
AO 0 1
AU 0 1
CodePudding user response:
- Combine all of the dataframes, regardless of how many, with
pd.concat
and use.assign
in a list comprehension to add a'source'
column.source=f'df{i}'
: structure the f-string for how the column name should appear in the frequency table.- If loading the data from files, see Option 4 of this answer to load the csv files directly into a single dataframe with a new column.
- Use
pd.crosstab
to compute a frequency table of the two columns.
import pandas as pd
# sample dataframes
df1 = pd.DataFrame({'country': ['US', 'US', 'CA', 'CN', 'AR']})
df2 = pd.DataFrame({'country': ['AR', 'AD', 'AO', 'AU', 'US']})
# list of dataframes
df_list = [df1, df2]
# combine dataframes
df = pd.concat([d.assign(source=f'df{i}') for i, d in enumerate(df_list, 1)], ignore_index=True)
# create frequency table
counts = pd.crosstab(df.country, df.source)
source df1 df2
country
AD 0 1
AO 0 1
AR 1 1
AU 0 1
CA 1 0
CN 1 0
US 2 1