Home > database >  Subtract 2 dataframes with different column length if key matches
Subtract 2 dataframes with different column length if key matches

Time:09-14

I have a dataframe which has columns with different length. I want to subtract columns VIEWS from each other if the fields URL match. This is my code which gives me completely false results and almost exclusively NAN values and floats which both doesn´t make sense to me. Is there a better solution for this or an obvious mistake in my code?

a = a.loc[:, ['VIEWS', 'URL']]
b = b.loc[:, ['VIEWS', 'URL']]

df = pd.concat([a,b], ignore_index=True)

df['VIEWS'] = pd.to_numeric(df['VIEWS'], errors='coerce').fillna(0).astype(int)
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].diff().abs()

CodePudding user response:

Great question!

Let's start with a possible solution

I assume you want to deduct the total of the first from the total of the second per group. Taking your cleaning as the basis, here's a small, (hopefully) complete example, which uses .sum() and multiplies the views from b by -1 prior to grouping:

import pandas as pd
import numpy as np

a = pd.DataFrame(data = [
        [100, 'x.me'], [200, 'y.me'], [50, 'x.me'], [np.nan, 'y.me']
    ], columns=['VIEWS', 'URL'])

b = pd.DataFrame(data = [
        [90, 'x.me'], [200, 'z.me'],
    ], columns=['VIEWS', 'URL'])

for x in [a, b]:
    x['VIEWS'] = pd.to_numeric(x['VIEWS'], errors='coerce').fillna(0).astype(int)

df = pd.concat([x.groupby(['URL'])['VIEWS'].apply(lambda y: y.sum() * (1 - 2 * cnt)).reset_index(drop = False) for (cnt, x) in enumerate([a, b])], ignore_index=True)
df = df.groupby(['URL'])['VIEWS'].sum().abs().reset_index()

A few words on why your approach is currently not working

  1. diff() There is a diff function for the SeriesGroupBy class. It takes the difference of some row to the previous row in the group. Check this out for a valid usage of diff() in this context: Pandas groupby multiple fields then diff
  2. nan's appear in your last operation since you're trying to set a series object with the indices being the urls onto a series with completely different indices.

So if anything, an operation such as the following could work

df['VIEWS'] = df.groupby(['URL'])['VIEWS'].sum().reset_index(drop=True)

although this still assumes, that df does not change in size and that the indices on the left side accord the ones after the reset on the right side.

  • Related