I have a dataframe named last_modified_tracking_df, which has columns as --> Source file,Last_modified_date and Target_folder.
I have a dictionary last_modified_file_date_dict which has type filenames in keys. I want to remove the rows from last_modified_tracking_df which files don't exist in last_modified_tracking_df.
I have written the below code for that-->
for log_path in last_modified_tracking_df['Source file']:
if log_path not in last_modified_file_date_dict.keys():
last_modified_tracking_df = last_modified_tracking_df[last_modified_tracking_df['Source file'].ne(log_path)]
print(last_modified_tracking_df)
This code is removing the required rows but output is coming as
0 C:\test\file1... 2021-07-05T10:38:58 target1
1 C:\test\file2... 2021-10-20T09:18:43 target2
3 C:\test\file4... 2021-10-21T04:54:37 target4
4 C:\test\file5... 2021-10-18T13:28:16 target5
In the above output the index is getting skipped for deleted row. My question is "is there any better efficient approach to do this?" and "how can I get consecutive indexes even after removing the rows?" Please help!
CodePudding user response:
Using DataFrame.reset_index(drop=True)
works, and assuming you don't want to reassign the dataframe to a new variable, you'd also want inplace=True
last_modified_tracking_df.reset_index(drop=True, inplace=True)
Otherwise if you'd want to reassign, you'd go for
new_df = last_modified_tracking_df.reset_index(drop=True)
This resets and gives each row a consecutive index value in respect to the previous row's index value.
As for checking the row's values, you'd probably want to iterate through the rows (DataFrame.iterrows()
) and delete/drop just the single row, not create a new dataframe without said row:
for index, row in last_modified_tracking_df.iterrows():
if row["Source file"] not in last_modified_file_date_dict.keys():
df.drop(index, inplace=True)
CodePudding user response:
Alternatively, you could get rid of the loop entirely, and use isin
instead of ne
So assuming you have
>>> last_modified_tracking_df
source_file last_modified_date target_folder
0 C:\test\file1 2021-07-05T10:38:58 target1
1 C:\test\file2 2021-10-20T09:18:43 target2
2 C:\test\file3 2021-10-18T13:28:16 target3
3 C:\test\file4 2021-10-21T04:54:37 target4
4 C:\test\file5 2021-10-18T13:28:16 target5
5 C:\test\file6 2021-10-18T13:28:16 target6
And
>>> last_modified_file_date_dict
{'C:\\test\\file1': '2021-07-05T10:38:58',
'C:\\test\\file3': '2021-10-21T04:54:37'}
Just call
last_modified_tracking_df = last_modified_tracking_df[
last_modified_tracking_df["source_file"].isin(last_modified_file_date_dict.keys())
].reset_index(drop=True)
Result
>>> last_modified_tracking_df
source_file last_modified_date target_folder
0 C:\test\file1 2021-07-05T10:38:58 target1
1 C:\test\file3 2021-10-18T13:28:16 target3