I have two large CSV files with sample data as follows:
df1 =
Index Fruit Vegetable
0 Mango Spinach
1 Berry Carrot
2 Banana Cabbage
df2 =
Index Unit Price
0 Mango 30
1 Artichoke 45
2 Banana 12
3 Berry 10
4 Cabbage 25
5 Rice 40
6 Spinach 34
7 Carrot 08
8 Lentil 12
9 Pot 32
I would like to create the following dataframe:
df3 =
Index Fruit Price Vegetable Price
0 Mango 30 Spinach 34
1 Berry 10 Carrot 08
2 Banana 12 Cabbage 25
I am looking to compare the price of each unit row-wise in df1. If the prices are within $5, I would like to output them in a separate dataframe as follows:
df4 =
Index Fruit Price Vegetable Price
0 Mango 30 Spinach 34
1 Berry 10 Carrot 08
What would be a generic way to achieve that? Thank you in advance.
CodePudding user response:
You can use replace
to create a price dataframe based on df2
, then join
to concatenate with the original data.
Note that duplicate column names are discouraged:
# print to see what it does
item_prices = dict(zip(df2.Unit, df2.Price))
out = df1.join(df1.replace(item_prices).add_suffix('_Price')).sort_index(axis=1)
Output:
Fruit Fruit_Price Vegetable Vegetable_Price
Index
0 Mango 30 Spinach 34
1 Berry 10 Carrot 8
2 Banana 12 Cabbage 25
For the next question, you need a boolean loc access:
out[abs(out['Fruit_Price'] - out['Vegetable_Price']) < 5]
or a query
:
out.query('abs(Fruit_Price-Vegetable_Price)<5')
Output:
Fruit Fruit_Price Vegetable Vegetable_Price
Index
0 Mango 30 Spinach 34
1 Berry 10 Carrot 8
CodePudding user response:
You can use a double merge:
fruit = df1[['Fruit']].merge(df2.rename(columns={'Unit': 'Fruit'}), on='Fruit')
veggie = df1[['Vegetable']].merge(df2.rename(columns={'Unit': 'Vegetable'}), on='Vegetable')
df3 = pd.concat([fruit, veggie], axis=1)
print(df3)
# Output:
Fruit Price Vegetable Price
0 Mango 30 Spinach 34
1 Berry 10 Carrot 8
2 Banana 12 Cabbage 25
Then
df4 = df3[np.abs(np.subtract(*out['Price'].values.T)) <= 5]
print(df4)
# Output:
Fruit Price Vegetable Price
0 Mango 30 Spinach 34
1 Berry 10 Carrot 8
CodePudding user response:
One generic alternative (can handle an arbitrary number of categories) is to reshape before (using melt
) and after (using pivot
). This has the advantage to create a MultiIndex that is quite convenient to explicitly identify the Prices categories:
out = (df1.melt(id_vars='Index', value_name='Unit')
.merge(df2.drop(columns='Index'), on='Unit')
.pivot(index='Index', columns='variable', values=['Unit', 'Price'])
)
output:
Unit Price
variable Fruit Vegetable Fruit Vegetable
Index
0 Mango Spinach 30 34
1 Berry Carrot 10 8
2 Banana Cabbage 12 25
subsetting the rows with diff ≤ 5:
out[out['Price'].diff(axis=1).abs().le(5).any(1)]
output:
Unit Price
variable Fruit Vegetable Fruit Vegetable
Index
0 Mango Spinach 30 34
1 Berry Carrot 10 8