I have a dataframe and a dictionary in Python as shown below and I need to filter the dataframe based on the dictionary. As you see, the keys and values of the dictionary are two columns of the dataframe. I want to have a subset of dataframe which contains the keys and values of dictionary plus other columns.
df :
Customer_ID | Category | Type | Delivery |
---|---|---|---|
40275 | Book | Buy | True |
40275 | Software | Sell | False |
40275 | Video Game | Sell | False |
40275 | Cell Phone | Sell | False |
39900 | CD/DVD | Sell | True |
39900 | Book | Buy | True |
39900 | Software | Sell | True |
35886 | Cell Phone | Sell | False |
35886 | Video Game | Buy | False |
35886 | CD/DVD | Sell | False |
35886 | Software | Sell | False |
40350 | Software | Sell | True |
28129 | Software | Buy | False |
And dictionary is:
d = {
40275: ['Book','Software'],
39900: ['Book'],
35886: ['Software'],
40350: ['Software'],
28129: ['Software']
}
And I need the following dataframe:
Customer_ID | Category | Type | Delivery |
---|---|---|---|
40275 | Book | Buy | True |
40275 | Software | Sell | False |
39900 | Book | Buy | True |
35886 | Software | Sell | False |
40350 | Software | Sell | True |
28129 | Software | Buy | False |
CodePudding user response:
Flatten the dictionary and create a new dataframe, then inner merge
df
with the new dataframe
df.merge(pd.DataFrame([{'Customer_ID': k, 'Category': i}
for k, v in d.items() for i in v]))
Customer_ID Category Type Delivery
0 40275 Book Buy True
1 40275 Software Sell False
2 39900 Book Buy True
3 35886 Software Sell False
4 40350 Software Sell True
5 28129 Software Buy False
CodePudding user response:
We can set_index
to the Customer_ID
and Category
columns then build a list of tuples from the dictionary d
and reindex
the DataFrame to include only the rows which match the list of tuples, then reset_index
to restore the columns:
new_df = df.set_index(['Customer_ID', 'Category']).reindex(
[(k, v) for k, lst in d.items() for v in lst]
).reset_index()
new_df
:
Customer_ID Category Type Delivery
0 40275 Book Buy True
1 40275 Software Sell False
2 39900 Book Buy True
3 35886 Software Sell False
4 40350 Software Sell True
5 28129 Software Buy False
*Note this only works if the MultiIndex is unique (like the shown example). It will also add rows if the dictionary does not represent a subset of the DataFrame's MultiIndex (which may or may not be the desired behaviour).
Setup:
import pandas as pd
d = {
40275: ['Book', 'Software'],
39900: ['Book'],
35886: ['Software'],
40350: ['Software'],
28129: ['Software']
}
df = pd.DataFrame({
'Customer_ID': [40275, 40275, 40275, 40275, 39900, 39900, 39900, 35886,
35886, 35886, 35886, 40350, 28129],
'Category': ['Book', 'Software', 'Video Game', 'Cell Phone', 'CD/DVD',
'Book', 'Software', 'Cell Phone', 'Video Game', 'CD/DVD',
'Software', 'Software', 'Software'],
'Type': ['Buy', 'Sell', 'Sell', 'Sell', 'Sell', 'Buy', 'Sell', 'Sell',
'Buy', 'Sell', 'Sell', 'Sell', 'Buy'],
'Delivery': [True, False, False, False, True, True, True, False, False,
False, False, True, False]
})
CodePudding user response:
You can use df.merge
with df.append
:
In [444]: df1 = pd.DataFrame.from_dict(d, orient='index', columns=['Cat1', 'Cat2']).reset_index()
In [449]: res = df.merge(df1[['index', 'Cat1']], left_on=['Customer_ID', 'Category'], right_on=['index', 'Cat1']).drop(['index', 'Cat1'], 1)
In [462]: res = res.append(df.merge(df1[['index', 'Cat2']], left_on=['Customer_ID', 'Category'], right_on=['index', 'Cat2']).drop(['index', 'Cat2'], 1)).sort_values('Customer_ID', ascending=False)
In [463]: res
Out[463]:
Customer_ID Category Type Delivery
3 40350 Software Sell True
0 40275 Book Buy True
0 40275 Software Sell False
1 39900 Book Buy True
2 35886 Software Sell False
4 28129 Software Buy False