I have a large Pandas DataFrame made of 298k rows. One column is ID and it is unique for each row.
I want to perform two operations:
Filter the dataframe rows such that the ID value for each row is within a Python list, called my_list that contains 13k elements. I tried using query() and it works fine:
df= df.query(f'ID in {my_list}')
Now, I want to rearrange the dataframe such that the order of the IDs is the same as the one in my_list, which is an ordered list that may contain duplicates.
If there is a duplicate in the list, I want to add the duplicate row even in the dataframe, keeping the order.
I tried using categories to order the dataframe but it does not work with duplicates and I have no idea how to add the duplicate row if there is one in the list:df['ID'] = pd.Categorical(df['ID'], categories=mylist, ordered=True)
Any efficient solutions? Thanks
Numerical example:
I have a list with these IDs: [1, 2, 3, 1, 4, 2].
And I have a dataframe with these rows:
ID | Name | ... |
---|---|---|
0 | John | ... |
1 | Paul | ... |
2 | Rick | ... |
3 | Jean | ... |
4 | Mark | ... |
5 | Neil | ... |
6 | Martin | ... |
I want in output this dataframe, following the list order and adding duplicates in the df:
ID | Name | ... |
---|---|---|
1 | Paul | ... |
2 | Rick | ... |
3 | Jean | ... |
1 | Paul | ... |
4 | Mark | ... |
2 | Rick | ... |
CodePudding user response:
If the original index is not critical, you can reindex
:
df.set_index('ID').reindex(my_list).reset_index()
CodePudding user response:
This should work:
id_list = [43849175, 13239461, 14461786, 14467946, 14461786]
df_concat = pd.DataFrame()
for id in id_list:
row = df[df["ID"] == id]
df_concat = pd.concat([df_concat, row], ignore_index=True)
Maybe a more efficient way to do it:
df_result = pd.DataFrame({"ID": id_list})
df_result = df_result.merge(df, left_on="ID", right_on="ID", how="left")