So basically what I would like to do is to make sure that cell (x, y)
from either DF1 or DF2 is filled but not in both, for all cells in these dataframes. DF1 and DF2 are of equal shape so there is an equal amount of cells. If both cells in the same location in DF1 and DF2 are filled then it should raise an exception that something goes wrong.
For some reason, I can't seem to be able to wrap my head around it, although it sounds quite easy.
What I've tried:
- Check with
.notnull()
and then compare both of them > results in a big boolean mess that is not distinguishable. - Could do it with a double for loop but that just does not seem pythonic enough.
See below examples of DF1/DF2. The indices and columns are identical, only different parts are filled, the empty cells are filled with np.nan
. The cell values contain the number of orders on a certain day for a certain delivery day. The goal is to condense this to a matrix containing the x-week average from a certain order day (mon-sun) for a certain delivery day (mon - sun).
EDIT: text files and expected output
DF1.csv
order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,34.0,,214.0,74.0,46.0,21.0,19.0,,,,,,,,,,,,,,
Saturday,,,79.0,154.0,75.0,28.0,16.0,14.0,,,,,,,,,,,,,
Sunday,,,,301.0,183.0,60.0,42.0,25.0,,,,,,,,,,,,,
Monday,,,,49.0,61.0,216.0,104.0,36.0,,28.0,,,,,,,,,,,
Tuesday,,,,,47.0,180.0,77.0,36.0,,17.0,8.0,,,,,,,,,,
Wednesday,,,,,,84.0,200.0,69.0,,58.0,24.0,10.0,,,,,,,,,
Thursday,,,,,,,84.0,148.0,,87.0,37.0,10.0,3.0,,,,,,,,
DF2.csv
order_day,2022-06-18,2022-06-19,2022-06-20,2022-06-21,2022-06-22,2022-06-23,2022-06-24,2022-06-25,2022-06-26,2022-06-27,2022-06-28,2022-06-29,2022-06-30,2022-07-01,2022-07-02,2022-07-03,2022-07-04,2022-07-05,2022-07-06,2022-07-07,2022-07-08
Friday,,,,,,,,44.0,,290.0,86.0,54.0,13.0,16.0,,,,,,,
Saturday,,,,,,,,,,135.0,177.0,125.0,24.0,28.0,8.0,,,,,,
Sunday,,,,,,,,,,,358.0,181.0,58.0,48.0,29.0,,,,,,
Monday,,,,,,,,,,,101.0,156.0,96.0,60.0,32.0,,15.0,,,,
Tuesday,,,,,,,,,,,,3.0,38.0,20.0,6.0,,4.0,2.0,,,
Wednesday,,,,,,,,,,,,,,,,,,,,,
Thursday,,,,,,,,,,,,,,,,,,,,,
Load with pd.read_csv('DF2.csv', index_col='order_day')
Expected output
There is not really an exact expected output. It could be something like print('No filled cells overlap!')
.
For this MRE you can be fairly sure that there is no overlap. However, I am going to work with larger date ranges and I don't want to rely on good faith.
CodePudding user response:
Update
A most useful output to analyze:
dups = (pd.concat([df1.set_index('order_day').stack(),
df2.set_index('order_day').stack()],
keys=['df1', 'df2'], axis=1)
.loc[lambda x: x.notna().all(axis=1)])
print(dups)
# Output:
df1 df2
order_day
Fri 2022-06-20 1.0 2.0
Sat 2022-06-18 1.0 3.0
2022-06-20 3.0 2.0
Tue 2022-06-20 3.0 1.0
Thu 2022-06-19 1.0 3.0
Setup a MRE:
import pandas as pd
import numpy as np
wdays = ['Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu']
dates = ['2022-06-18', '2022-06-19', '2022-06-20']
np.random.seed(2022)
data1 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.2, .1, .2, .5])
np.random.seed(2021)
data2 = np.random.choice([1, 2, 3, np.nan], (7, 3), p=[.1, .2, .2, .5])
df1 = pd.DataFrame(data1, wdays, dates).rename_axis('order_day').reset_index()
df2 = pd.DataFrame(data2, wdays, dates).rename_axis('order_day').reset_index()
print(df1)
print(df2)
# df1
order_day 2022-06-18 2022-06-19 2022-06-20
0 Fri 1.0 3.0 1.0
1 Sat 1.0 NaN 3.0
2 Sun NaN NaN NaN
3 Mon NaN NaN NaN
4 Tue NaN NaN 3.0
5 Wed 3.0 3.0 NaN
6 Thu NaN 1.0 NaN
# df2
order_day 2022-06-18 2022-06-19 2022-06-20
0 Fri NaN NaN 2.0
1 Sat 3.0 NaN 2.0
2 Sun 2.0 NaN NaN
3 Mon NaN 1.0 1.0
4 Tue NaN NaN 1.0
5 Wed NaN NaN NaN
6 Thu NaN 3.0 3.0
Old answer
Flat your 2 dataframes (stack
drops NaN values by default) then concatenate them and check duplicate index:
>>> dups = (pd.concat([df1.set_index('order_day').stack(),
df2.set_index('order_day').stack()])
.loc[lambda x: x.index.duplicated(keep=False)])
Series([], dtype: float64)
CodePudding user response:
This does what I want, but it seems to me that this could be done easier/more pythonic.
for col in df1.columns:
for idx in df1.index:
if pd.notna(df1.loc[idx, col]) and pd.notna(df2.loc[idx, col]):
raise Exception(f"Cells ({idx = }, {col = }) both contain values.")