Python Pandas VLOOKUP function with categorical and non-numeric values-CodePudding

I want to optimize a process of a "vlookup" in Python that works but is not scalable in its current form. I have tried pythons pivot.table and pivot but it's been limited due to alphanumeric and string values in cells. I have two tables:

table1:

ProductID	Sales
123456	34
abc123	34
123def	34
a1234f	34
1abcd6	34

table2:

Brand	Site1	Site2	Site3
Brand1	123456	N/A	N/A
Brand2	N/A	abc123	N/A
Brand1	N/A	N/A	123def
Brand2	N/A	1abcd6	N/A
Brand1	a1234f	N/A	N/A

What I originally wanted to see was sales by brand:

Brand	Sales
Brand1	102
Brand2	68

Here's the pseudocode I've basically built out in Python and Pandas:

#  read sales and product tables into pandas
sales_df = pd.read_csv(table1)
product_df = pd.read_csv(table2)

#  isolate each product id column into separate dfs
product_site1_df = product_df.drop(['Site2', 'Site3'],axis=1)
product_site2_df = product_df.drop(['Site1', 'Site3'],axis=1)
product_site3_df = product_df.drop(['Site1', 'Site2'],axis=1)

#  rename and append all product ids into a single column
product_site1_df.rename(columns={"Site1": "ProductID"})
product_site2_df.rename(columns={"Site2": "ProductID"})
product_site3_df.rename(columns={"Site3": "ProductID"})

product_list_master_df = pd.concat([product_site1_df, product_site2_df, product_site3_df])


#compare sales df and product df, pulling brand in as a new column to the sales table
inner_join = pd.merge(sales_df, 
                  product_df, 
                  on ='ProductID', 
                  how ='inner')

This is obviously very procedural, not scalable, computationally redundant, and seems very round-about to get to what I want. Additionally, I'm losing data such as if I want to do a pivot based on sites rather than sales. Short of changing the data model itself, what can I do here to improve speed, versatility, and lines of code?

CodePudding user response：

Assuming the dataframes are named df1 and df2, you can reshape and map to perform the VLOOKUP, then groupby sum:

(df2.set_index('Brand')
    .stack()
    .map(df1.set_index('ProductID')['Sales'])
    .groupby(level='Brand').sum()
 )

Output:

Brand
Brand1    102
Brand2     68

CodePudding user response：

Here's how you can do it, without Pandas, just using Python's standard CSV lib, and a Counter (for your sales-by-brand):

import csv
from collections import Counter

# Create a product/sales lookup
sales_by_product = {}

with open('sales.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    for row in reader:
        p_id, sales = row
        sales_by_product[p_id] = int(sales)


sales_by_brand_counter = Counter()

with open('products.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    for row in reader:
        brand_id = row[0]
        for p_id in row[1:]:
            sales = sales_by_product.get(p_id, 0)
            sales_by_brand_counter[brand_id]  = sales

with open('sales_by_brand.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Brand', 'Sales'])

    rows = [[elem, cnt] for (elem, cnt) in sales_by_brand_counter.items()]
    writer.writerows(rows)

When I run that with sales.csv:

ProductID,Sales
123456,34
abc123,34
123def,34
a1234f,34
1abcd6,34

and products.csv:

Brand,Site1,Site2,Site3
Brand1,123456,N/A,N/A
Brand2,N/A,abc123,N/A
Brand1,N/A,N/A,123def
Brand2,N/A,1abcd6,N/A
Brand1,a1234f,N/A,N/A

I get sales_by_brand.csv:

Brand,Sales
Brand1,102
Brand2,68

The work that really matters, finding product IDs and summing sales is handled here:

for row in reader:
    brand_id = row[0]
    for p_id in row[1:]:
        sales = sales_by_product.get(p_id, 0)
        sales_by_brand_counter[brand_id]  = sales

It can read through as many Site columns as there are. If the site contains 'N/A' or a product ID that isn't in the lookup dict, it just adds 0 to that brand.