Delete row of Pandas DataFrame which matches a string and maintain the index-CodePudding

I have a dataframe named last_modified_tracking_df, which has columns as --> Source file,Last_modified_date and Target_folder.

I have a dictionary last_modified_file_date_dict which has type filenames in keys. I want to remove the rows from last_modified_tracking_df which files don't exist in last_modified_tracking_df.

I have written the below code for that-->

    for log_path in last_modified_tracking_df['Source file']:
                if log_path not in last_modified_file_date_dict.keys():
                    last_modified_tracking_df = last_modified_tracking_df[last_modified_tracking_df['Source file'].ne(log_path)]
print(last_modified_tracking_df)

This code is removing the required rows but output is coming as

0   C:\test\file1...  2021-07-05T10:38:58   target1
1   C:\test\file2...  2021-10-20T09:18:43   target2
3   C:\test\file4...  2021-10-21T04:54:37   target4
4   C:\test\file5...  2021-10-18T13:28:16   target5

In the above output the index is getting skipped for deleted row. My question is "is there any better efficient approach to do this?" and "how can I get consecutive indexes even after removing the rows?" Please help!

CodePudding user response：

Using DataFrame.reset_index(drop=True) works, and assuming you don't want to reassign the dataframe to a new variable, you'd also want inplace=True

last_modified_tracking_df.reset_index(drop=True, inplace=True)

Otherwise if you'd want to reassign, you'd go for

new_df = last_modified_tracking_df.reset_index(drop=True)

This resets and gives each row a consecutive index value in respect to the previous row's index value.

As for checking the row's values, you'd probably want to iterate through the rows (DataFrame.iterrows()) and delete/drop just the single row, not create a new dataframe without said row:

for index, row in last_modified_tracking_df.iterrows():
    if row["Source file"] not in last_modified_file_date_dict.keys():
        df.drop(index, inplace=True)

CodePudding user response：

Alternatively, you could get rid of the loop entirely, and use isin instead of ne

So assuming you have

>>> last_modified_tracking_df
     source_file   last_modified_date target_folder
0  C:\test\file1  2021-07-05T10:38:58       target1
1  C:\test\file2  2021-10-20T09:18:43       target2
2  C:\test\file3  2021-10-18T13:28:16       target3
3  C:\test\file4  2021-10-21T04:54:37       target4
4  C:\test\file5  2021-10-18T13:28:16       target5
5  C:\test\file6  2021-10-18T13:28:16       target6

And

>>> last_modified_file_date_dict
{'C:\\test\\file1': '2021-07-05T10:38:58',
 'C:\\test\\file3': '2021-10-21T04:54:37'}

Just call

last_modified_tracking_df = last_modified_tracking_df[
    last_modified_tracking_df["source_file"].isin(last_modified_file_date_dict.keys())
].reset_index(drop=True)

Result

>>> last_modified_tracking_df
     source_file   last_modified_date target_folder
0  C:\test\file1  2021-07-05T10:38:58       target1
1  C:\test\file3  2021-10-18T13:28:16       target3