Home > Software engineering >  Subtracting the values in another DataFrame from every column in a pandas DataFrame
Subtracting the values in another DataFrame from every column in a pandas DataFrame

Time:03-14

I have two DataFrames of 20 rows and 4 columns. The names and value types of the columns are the same. One of the columns is the title, the other 3 are values.

df1
title  col1 col2 col3
apple    a    d    g
pear     b    e    h
grape    c    f    i

df2
title  col1 col2 col3
carrot   q    t    w
pumpkin  r    u    x
sprouts  s    v    y

Now I would like to create 3 separate tables/lists subtracting each value of df1.col1 - df2.col1 | df1.col2 - df2.col2 | df1.col3 - df2.col3. For df1.col1 - df2.col1 I expect an output that looks something among the lines of:

df1.title  df2.title score
apple      carrot    (a - q)
apple      pumpkin    (a - r)
apple      sprouts   (a - s)
pear       carrot    (b - t)
pear       pumpkin   (b - u)
pear       sprouts   (b - v)
grape      carrot    (c - w)
grape      pumpkin   (c - x)
grape      sprouts   (c - y)

I tried to create a for loop using the following code:

for i in df1.iterrows():
    score_col1 = df1.col1[[i]] - df2.col2[[j]]
    score_col2 = df1.col2[[i]] - df2.col2[[j]]
    score_col3 = df1.col3[[i]] - df2.col3[[j]]
    score_total = score_col1   score_col2   score_col3
    i = i   1

In return, I received an output for score_col1 looking like this:

df1.title  df2.title score
apple      carrot    (a - q)
pear       carrot    (b - t)
grape      carrot    (c - w)

Can someone help me to obtain the expected output?

CodePudding user response:

Since you want 3 separate DataFrames, we could use a loop (if you want a single DataFrame, we could do a similar job a bit differently).

We could unstack df2 and iteratively subtract it from the repeated columns of df1:

out = []
df2_stacked = df2.set_index('title').unstack().droplevel(0).reset_index(name='score')
for col in df1.filter(like='col'):
    tmp = (df1[['title', col]]
           .loc[df1.index.repeat(len(df2))]
           .reset_index(drop=True)
           .join(df2_stacked, lsuffix='_df1', rsuffix='_df2'))
    tmp['score'] = tmp[col] - tmp['score']
    out.append(tmp.drop(columns=col))

Let's test it on a numerical example:

df1:

   title  col1  col2  col3
0  apple  1000   100    10
1   pear  2000   200    20
2  grape  3000   300    30

df2:

     title  col1  col2  col3
0   carrot     1     4     7
1  pumpkin     2     5     8
2  sprouts     3     6     9

Then if run the code above and print out, it contains the following three DataFrames:

      title_df1 title_df2  score
    0     apple    carrot    999
    1     apple   pumpkin    998
    2     apple   sprouts    997
    3      pear    carrot   1996
    4      pear   pumpkin   1995
    5      pear   sprouts   1994
    6     grape    carrot   2993
    7     grape   pumpkin   2992
    8     grape   sprouts   2991


      title_df1 title_df2  score
    0     apple    carrot     99
    1     apple   pumpkin     98
    2     apple   sprouts     97
    3      pear    carrot    196
    4      pear   pumpkin    195
    5      pear   sprouts    194
    6     grape    carrot    293
    7     grape   pumpkin    292
    8     grape   sprouts    291



      title_df1 title_df2  score
    0     apple    carrot      9
    1     apple   pumpkin      8
    2     apple   sprouts      7
    3      pear    carrot     16
    4      pear   pumpkin     15
    5      pear   sprouts     14
    6     grape    carrot     23
    7     grape   pumpkin     22
    8     grape   sprouts     21

CodePudding user response:

a1 = ['apple','pear', 'banana']
b1 = [56,32,23]
c1 = [12,34,90]
d1 = [87,65,23]

a2 = ['carrot','pumpkin','sprouts']
b2 = [16,12,93]
c2 = [12,32,70]
d2 = [81,55,21]

df1 = pd.DataFrame({'title':a1, 'col1':b1, 'col2':c1, 'col3':d1})
df2 = pd.DataFrame({'title':a2, 'col1':b2, 'col2':c2, 'col3':d2})

res_df = pd.DataFrame([])
cols = ['col1','col2','col3']

for c in cols:
    res_df = pd.DataFrame([])
    for i,j in df1.iterrows():
        for k,l in df2.iterrows():
            res_df = res_df.append(pd.DataFrame({'title_df1':j.title, 'title_df2':l.title, 'score':j[str(c)] - l[str(c)]},index=[0]), ignore_index=True)

    print(res_df)
  • Related