How can i insert a blank row in pandas and properly increment the index multiple times-CodePudding

I have 2 pandas dataframes, both of them having the same columns but different line numbers depending on missing rows, one of the columns is the Date having the following format 29/09/2020 13.22.57 the day month year omitted sometimes below for simplicity and irrelevance The Dates might be an exact match in df as df_2 or there might be an acceptable delay of a threshold that we preset, in this case it's 2s.

Sample data of df['Date']:

13.22.57
13.23.12
13.23.44
13.24.01
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
13.26.54

Sample data of df_2['Date']:

In the data sample shared above the first 6 values are a match since the absolute difference of date is less than 2 seconds.

In the 7th line however we have 13.25.07 and 13.25.23 , so the expected in this case is to add an empty line above the line of 13.25.23 since that date is bigger than the other:

So now we have 13.25.07 and 0 having the same index, meaning that if we increment the index for both 13.25.23 will now match for both on the same index 8.

If we continue the same process we will be left with:

df['Date']:

13.22.57    
13.23.13    
13.23.44    
13.24.02    
13.24.19    
13.24.35
0                       
13.25.23
0                       
13.26.13    
13.26.38



df_2['Date']:

13.22.57
13.23.12
13.23.44
13.24.01
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
0
13.26.54

the increment can happen for either df or df_2 depends which has a bigger time for the missing column and at the end both should have the same number of rows as the rows that are not matching will have a 0 value now and an increment will happen to the one under.

My Code:

for idx, val in enumerate(df['Date']):
    print(idx)
    print(val)
    print(df_2['Date'][idx 1])
    print(abs(datetime.timedelta.total_seconds(val - df_2['Date'][idx 1])))
    if abs(datetime.timedelta.total_seconds(val - df_2['Date'][idx 1])) > 2:

        if val < df_2['Date'][idx 1]:
            df_2.loc[idx 1.5] = 0
            df_2 = df_2.sort_index().reset_index(drop=True)
            print('smaller')

        else:
            df.loc[idx   0.5] = 0
            df = df.sort_index().reset_index(drop=True)
            print(df.loc[idx])
            print('bigger')

    else:
        print('pass for index '   str(idx))

The code was working fine and detected the first two 0 lines, but then on the second kept printing 0s:

2020-09-29 13.22.57
2020-09-29 13.23.13
2020-09-29 13.23.44
2020-09-29 13.24.02
2020-09-29 13.24.19
2020-09-29 13.24.35
0
2020-09-29 13.25.23
0
0
0
0
0
...

Dataframes:

d = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data=d)
pd.to_datetime(df['Date'], format='%H.%M.%S')

d2 = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df_2 = pd.DataFrame(data=d2)
pd.to_datetime(df_2['Date'], format='%H.%M.%S')

CodePudding user response：

This works for me: Note, Assumption is that len(df)>len(df_2)

a={"Date": [
"29-09-2020 13:22:57",
"29-09-2020 13:23:12",
"29-09-2020 13:23:44",
"29-09-2020 13:24:01",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:07",
"29-09-2020 13:25:23",
"29-09-2020 13:26:00",
"29-09-2020 13:26:13",
"29-09-2020 13:26:54",
]}
b={"Date":[
    "29-09-2020 13:22:57",    
    "29-09-2020 13:23:13",    
    "29-09-2020 13:23:44",    
    "29-09-2020 13:24:02",    
    "29-09-2020 13:24:19",    
    "29-09-2020 13:24:35",                        
    "29-09-2020 13:25:23",                        
    "29-09-2020 13:26:13",    
    "29-09-2020 13:26:38",
]
}
df=pd.DataFrame(a)
df["Date"]=pd.to_datetime(df["Date"])
df_2=pd.DataFrame(b)
df_2["Date"]=pd.to_datetime(df_2["Date"])


def add_zero(dataframe,index,increment):
    dataframe.loc[index increment]=0
    dataframe = dataframe.sort_index().reset_index(drop=True)
    return dataframe

flag=True
idx=0
while flag==True:
    if idx >= len(df_2["Date"]):
        df_2=add_zero(df_2,idx,0.5)
        break
    if idx >= len(df["Date"]):
        df=add_zero(df,idx,0.5)
        break
    print(idx)
    print(df['Date'][idx])
    print(df_2['Date'][idx])
    diff=datetime.timedelta.total_seconds(df['Date'][idx] - df_2['Date'][idx])
    print(f"Diff: {diff}")
    if diff > 2:
        df=add_zero(df,idx,-0.5)
        print("greater")
    elif diff < -2:
        df_2=add_zero(df_2,idx,-0.5)
        print("smaller")
    else:
        print("Acceptable")

    idx=idx 1

    if idx>=max(len(df_2),len(df)):
        flag=False

Output

    Date                Date2
0   2020-09-29 13:22:57 2020-09-29 13:22:57
1   2020-09-29 13:23:12 2020-09-29 13:23:13
2   2020-09-29 13:23:44 2020-09-29 13:23:44
3   2020-09-29 13:24:01 2020-09-29 13:24:02
4   2020-09-29 13:24:19 2020-09-29 13:24:19
5   2020-09-29 13:24:35 2020-09-29 13:24:35
6   2020-09-29 13:25:07 0
7   2020-09-29 13:25:23 2020-09-29 13:25:23
8   2020-09-29 13:26:00 0
9   2020-09-29 13:26:13 2020-09-29 13:26:13
10  0                   2020-09-29 13:26:38
11  2020-09-29 13:26:54 0

CodePudding user response：

This should work for every dataframe:

df.loc[df.shape[0]] = [None for _ in range(len(df.columns))]