Home > Back-end >  Python: check if two dataframes contain filled cells in the same location
Python: check if two dataframes contain filled cells in the same location

Time:07-01

So basically what I would like to do is to make sure that cell (x, y) from either DF1 or DF2 is filled but not in both, for all cells in these dataframes. DF1 and DF2 are of equal shape so there is an equal amount of cells. If both cells in the same location in DF1 and DF2 are filled then it should raise an exception that something goes wrong.

For some reason, I can't seem to be able to wrap my head around it, although it sounds quite easy.

What I've tried:

  • Check with .notnull() and then compare both of them > results in a big boolean mess that is not distinguishable.
  • Could do it with a double for loop but that just does not seem pythonic enough.

See below examples of DF1/DF2. The indices and columns are identical, only different parts are filled, the empty cells are filled with np.nan. The cell values contain the number of orders on a certain day for a certain delivery day. The goal is to condense this to a matrix containing the x-week average from a certain order day (mon-sun) for a certain delivery day (mon - sun).

DF1

DF2


EDIT: text files and expected output

DF1.csv

order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,34.0,,214.0,74.0,46.0,21.0,19.0,,,,,,,,,,,,,,
Saturday,,,79.0,154.0,75.0,28.0,16.0,14.0,,,,,,,,,,,,,
Sunday,,,,301.0,183.0,60.0,42.0,25.0,,,,,,,,,,,,,
Monday,,,,49.0,61.0,216.0,104.0,36.0,,28.0,,,,,,,,,,,
Tuesday,,,,,47.0,180.0,77.0,36.0,,17.0,8.0,,,,,,,,,,
Wednesday,,,,,,84.0,200.0,69.0,,58.0,24.0,10.0,,,,,,,,,
Thursday,,,,,,,84.0,148.0,,87.0,37.0,10.0,3.0,,,,,,,,

DF2.csv

order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,,,,,,,,44.0,,290.0,86.0,54.0,13.0,16.0,,,,,,,
Saturday,,,,,,,,,,135.0,177.0,125.0,24.0,28.0,8.0,,,,,,
Sunday,,,,,,,,,,,358.0,181.0,58.0,48.0,29.0,,,,,,
Monday,,,,,,,,,,,101.0,156.0,96.0,60.0,32.0,,15.0,,,,
Tuesday,,,,,,,,,,,,3.0,38.0,20.0,6.0,,4.0,2.0,,,
Wednesday,,,,,,,,,,,,,,,,,,,,,
Thursday,,,,,,,,,,,,,,,,,,,,,

Load with pd.read_csv('DF2.csv', index_col='order_day')

Expected output

There is not really an exact expected output. It could be something like print('No filled cells overlap!'). For this MRE you can be fairly sure that there is no overlap. However, I am going to work with larger date ranges and I don't want to rely on good faith.

CodePudding user response:

Update

A most useful output to analyze:

dups = (pd.concat([df1.set_index('order_day').stack(),
                   df2.set_index('order_day').stack()],
                   keys=['df1', 'df2'], axis=1)
          .loc[lambda x: x.notna().all(axis=1)])
print(dups)

# Output:
                      df1  df2
order_day                     
Fri       2022-06-20  1.0  2.0
Sat       2022-06-18  1.0  3.0
          2022-06-20  3.0  2.0
Tue       2022-06-20  3.0  1.0
Thu       2022-06-19  1.0  3.0

Setup a MRE:

import pandas as pd
import numpy as np

wdays = ['Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu']
dates = ['2022-06-18', '2022-06-19', '2022-06-20']
np.random.seed(2022)
data1 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.2, .1, .2, .5])
np.random.seed(2021)
data2 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.1, .2, .2, .5])
df1 = pd.DataFrame(data1, wdays, dates).rename_axis('order_day').reset_index()
df2 = pd.DataFrame(data2, wdays, dates).rename_axis('order_day').reset_index()
print(df1)
print(df2)

# df1
  order_day  2022-06-18  2022-06-19  2022-06-20
0       Fri         1.0         3.0         1.0
1       Sat         1.0         NaN         3.0
2       Sun         NaN         NaN         NaN
3       Mon         NaN         NaN         NaN
4       Tue         NaN         NaN         3.0
5       Wed         3.0         3.0         NaN
6       Thu         NaN         1.0         NaN

# df2
  order_day  2022-06-18  2022-06-19  2022-06-20
0       Fri         NaN         NaN         2.0
1       Sat         3.0         NaN         2.0
2       Sun         2.0         NaN         NaN
3       Mon         NaN         1.0         1.0
4       Tue         NaN         NaN         1.0
5       Wed         NaN         NaN         NaN
6       Thu         NaN         3.0         3.0

Old answer

Flat your 2 dataframes (stack drops NaN values by default) then concatenate them and check duplicate index:

>>> dups = (pd.concat([df1.set_index('order_day').stack(),
                   df2.set_index('order_day').stack()])
              .loc[lambda x: x.index.duplicated(keep=False)])

Series([], dtype: float64)

CodePudding user response:

This does what I want, but it seems to me that this could be done easier/more pythonic.

for col in df1.columns:
    for idx in df1.index:
        if pd.notna(df1.loc[idx, col]) and pd.notna(df2.loc[idx, col]):
            raise Exception(f"Cells ({idx = }, {col = }) both contain values.")
  • Related