Concatenate two csv files based on column content in Pandas-CodePudding

I have two large CSV files with sample data as follows:

df1 = 
Index    Fruit   Vegetable    
    0    Mango   Spinach
    1    Berry   Carrot
    2    Banana  Cabbage

df2 = 
Index   Unit        Price
   0    Mango       30
   1    Artichoke   45
   2    Banana      12
   3    Berry       10
   4    Cabbage     25
   5    Rice        40
   6    Spinach     34
   7    Carrot      08
   8    Lentil      12
   9    Pot         32

I would like to create the following dataframe:

df3 = 
Index    Fruit   Price      Vegetable    Price   
    0    Mango   30         Spinach      34
    1    Berry   10         Carrot       08   
    2    Banana  12         Cabbage      25

I am looking to compare the price of each unit row-wise in df1. If the prices are within $5, I would like to output them in a separate dataframe as follows:

df4 = 
Index    Fruit   Price      Vegetable    Price   
    0    Mango   30         Spinach      34
    1    Berry   10         Carrot       08

What would be a generic way to achieve that? Thank you in advance.

CodePudding user response：

You can use replace to create a price dataframe based on df2, then join to concatenate with the original data.

Note that duplicate column names are discouraged:

# print to see what it does
item_prices = dict(zip(df2.Unit, df2.Price))

out = df1.join(df1.replace(item_prices).add_suffix('_Price')).sort_index(axis=1)

Output:

        Fruit  Fruit_Price Vegetable  Vegetable_Price
Index                                                
0       Mango           30   Spinach               34
1       Berry           10    Carrot                8
2      Banana           12   Cabbage               25

For the next question, you need a boolean loc access:

out[abs(out['Fruit_Price'] - out['Vegetable_Price']) < 5]

or a query:

out.query('abs(Fruit_Price-Vegetable_Price)<5')

Output:

       Fruit  Fruit_Price Vegetable  Vegetable_Price
Index                                               
0      Mango           30   Spinach               34
1      Berry           10    Carrot                8

CodePudding user response：

You can use a double merge:

fruit = df1[['Fruit']].merge(df2.rename(columns={'Unit': 'Fruit'}), on='Fruit')
veggie = df1[['Vegetable']].merge(df2.rename(columns={'Unit': 'Vegetable'}), on='Vegetable')

df3 = pd.concat([fruit, veggie], axis=1)
print(df3)

# Output:
    Fruit  Price Vegetable  Price
0   Mango     30   Spinach     34
1   Berry     10    Carrot      8
2  Banana     12   Cabbage     25

Then

df4 = df3[np.abs(np.subtract(*out['Price'].values.T)) <= 5]
print(df4)

# Output:
   Fruit  Price Vegetable  Price
0  Mango     30   Spinach     34
1  Berry     10    Carrot      8

CodePudding user response：

One generic alternative (can handle an arbitrary number of categories) is to reshape before (using melt) and after (using pivot). This has the advantage to create a MultiIndex that is quite convenient to explicitly identify the Prices categories:

out = (df1.melt(id_vars='Index', value_name='Unit')
          .merge(df2.drop(columns='Index'), on='Unit')
          .pivot(index='Index', columns='variable', values=['Unit', 'Price'])
       )

output:

            Unit           Price          
variable   Fruit Vegetable Fruit Vegetable
Index                                     
0          Mango   Spinach    30        34
1          Berry    Carrot    10         8
2         Banana   Cabbage    12        25

subsetting the rows with diff ≤ 5:

out[out['Price'].diff(axis=1).abs().le(5).any(1)]

output:

           Unit           Price          
variable  Fruit Vegetable Fruit Vegetable
Index                                    
0         Mango   Spinach    30        34
1         Berry    Carrot    10         8