I have 2 pandas dataframes, both of them having the same columns but different line numbers depending on missing rows, one of the columns is the Date
having the following format 29/09/2020 13.22.57
the day month year omitted sometimes below for simplicity and irrelevance
The Dates might be an exact match in df
as df_2
or there might be an acceptable delay of a threshold that we preset, in this case it's 2s.
Sample data of df['Date']
:
13.22.57
13.23.12
13.23.44
13.24.01
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
13.26.54
Sample data of df_2['Date']
:
13.22.57
13.23.13
13.23.44
13.24.02
13.24.19
13.24.35
13.25.23
13.26.13
13.26.38
In the data sample shared above the first 6 values are a match since the absolute difference of date is less than 2 seconds.
In the 7th line however we have 13.25.07
and 13.25.23
, so the expected in this case is to add an empty line above the line of 13.25.23
since that date is bigger than the other:
13.22.57
13.23.13
13.23.44
13.24.02
13.24.19
13.24.35
0
13.25.23
13.26.13
13.26.38
So now we have 13.25.07
and 0
having the same index, meaning that if we increment the index for both 13.25.23
will now match for both on the same index 8.
If we continue the same process we will be left with:
df['Date']:
13.22.57
13.23.13
13.23.44
13.24.02
13.24.19
13.24.35
0
13.25.23
0
13.26.13
13.26.38
df_2['Date']:
13.22.57
13.23.12
13.23.44
13.24.01
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
0
13.26.54
the increment can happen for either df
or df_2
depends which has a bigger time for the missing column and at the end both should have the same number of rows as the rows that are not matching will have a 0 value now and an increment will happen to the one under.
My Code:
for idx, val in enumerate(df['Date']):
print(idx)
print(val)
print(df_2['Date'][idx 1])
print(abs(datetime.timedelta.total_seconds(val - df_2['Date'][idx 1])))
if abs(datetime.timedelta.total_seconds(val - df_2['Date'][idx 1])) > 2:
if val < df_2['Date'][idx 1]:
df_2.loc[idx 1.5] = 0
df_2 = df_2.sort_index().reset_index(drop=True)
print('smaller')
else:
df.loc[idx 0.5] = 0
df = df.sort_index().reset_index(drop=True)
print(df.loc[idx])
print('bigger')
else:
print('pass for index ' str(idx))
The code was working fine and detected the first two 0 lines, but then on the second kept printing 0s:
2020-09-29 13.22.57
2020-09-29 13.23.13
2020-09-29 13.23.44
2020-09-29 13.24.02
2020-09-29 13.24.19
2020-09-29 13.24.35
0
2020-09-29 13.25.23
0
0
0
0
0
...
Dataframes:
d = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data=d)
pd.to_datetime(df['Date'], format='%H.%M.%S')
d2 = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df_2 = pd.DataFrame(data=d2)
pd.to_datetime(df_2['Date'], format='%H.%M.%S')
CodePudding user response:
This works for me: Note, Assumption is that len(df)>len(df_2)
a={"Date": [
"29-09-2020 13:22:57",
"29-09-2020 13:23:12",
"29-09-2020 13:23:44",
"29-09-2020 13:24:01",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:07",
"29-09-2020 13:25:23",
"29-09-2020 13:26:00",
"29-09-2020 13:26:13",
"29-09-2020 13:26:54",
]}
b={"Date":[
"29-09-2020 13:22:57",
"29-09-2020 13:23:13",
"29-09-2020 13:23:44",
"29-09-2020 13:24:02",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:23",
"29-09-2020 13:26:13",
"29-09-2020 13:26:38",
]
}
df=pd.DataFrame(a)
df["Date"]=pd.to_datetime(df["Date"])
df_2=pd.DataFrame(b)
df_2["Date"]=pd.to_datetime(df_2["Date"])
def add_zero(dataframe,index,increment):
dataframe.loc[index increment]=0
dataframe = dataframe.sort_index().reset_index(drop=True)
return dataframe
flag=True
idx=0
while flag==True:
if idx >= len(df_2["Date"]):
df_2=add_zero(df_2,idx,0.5)
break
if idx >= len(df["Date"]):
df=add_zero(df,idx,0.5)
break
print(idx)
print(df['Date'][idx])
print(df_2['Date'][idx])
diff=datetime.timedelta.total_seconds(df['Date'][idx] - df_2['Date'][idx])
print(f"Diff: {diff}")
if diff > 2:
df=add_zero(df,idx,-0.5)
print("greater")
elif diff < -2:
df_2=add_zero(df_2,idx,-0.5)
print("smaller")
else:
print("Acceptable")
idx=idx 1
if idx>=max(len(df_2),len(df)):
flag=False
Output
Date Date2
0 2020-09-29 13:22:57 2020-09-29 13:22:57
1 2020-09-29 13:23:12 2020-09-29 13:23:13
2 2020-09-29 13:23:44 2020-09-29 13:23:44
3 2020-09-29 13:24:01 2020-09-29 13:24:02
4 2020-09-29 13:24:19 2020-09-29 13:24:19
5 2020-09-29 13:24:35 2020-09-29 13:24:35
6 2020-09-29 13:25:07 0
7 2020-09-29 13:25:23 2020-09-29 13:25:23
8 2020-09-29 13:26:00 0
9 2020-09-29 13:26:13 2020-09-29 13:26:13
10 0 2020-09-29 13:26:38
11 2020-09-29 13:26:54 0
CodePudding user response:
This should work for every dataframe:
df.loc[df.shape[0]] = [None for _ in range(len(df.columns))]