I have done .value_counts()
on two dataFrames (similar column) and would like to compare the two.
I also tried with converting the resulting Series to dataframes (.to_frame('counts')
as suggested in this thread), but it doesn't help.
first = df1['company'].value_counts()
second = df2['company'].value_counts()
I tried to merge but I think the main problem is that I dont have the company name as a column but its the index (?). Is there a way to resolve it or to use a different way to get the comparison?
GOAL: The end goal is to be able to see which companies occur more in df2 than in df1, and the value_counts() themselves (or the difference between them).
CodePudding user response:
To compare the value counts between the two data frames, you can use the merge
function and specify the left_index
and right_index
parameters as True
. This will use the index values (in this case, the company names) as the keys to join the two data frames. You can then specify the how parameter as 'outer'
to get the union of the two value counts, and use the fillna
function to fill any missing values with 0.
Here's an example of how you can do this:
merged = first.to_frame('first').merge(second.to_frame('second'), left_index=True, right_index=True, how='outer').fillna(0)
This will give you a new data frame with the company names as the index and two columns 'first'
and 'second'
containing the value counts for each data frame. You can then use the apply
function to compare the two value counts and create a new column with the difference:
merged['difference'] = merged.apply(lambda x: x['second'] - x['first'], axis=1)
You can then use the sort_values
function to sort the data frame by the 'difference'
column to see which companies occur more in df2
than in df1
.
sorted_df = merged.sort_values('difference', ascending=False)
CodePudding user response:
You might use collections.Counter
ability to subtract
as follows
import collections
import pandas as pd
df1 = pd.DataFrame({'company':['A','A','A','B','B','C','Y']})
df2 = pd.DataFrame({'company':['A','B','B','C','C','C','Z']})
c1 = collections.Counter(df1['company'])
c2 = collections.Counter(df2['company'])
c1.subtract(c2)
print(c1)
gives output
Counter({'A': 2, 'Y': 1, 'B': 0, 'Z': -1, 'C': -2})
Explanation: where value is positive means more instances are in df1
, where value is zero then number is equal, where value is negative means more instances are in df2
.
CodePudding user response:
Use from this code
df2['x'] = '2'
df1['x'] = '1'
df = pd.concat([df1[['company', 'x']], df2[['company', 'x']]])
df = pd.pivot_table(df, index=['company'], columns=['x'], aggfunc={'values': 'sum'}).reset_index()
Now filter on df for related data