Home > Enterprise >  Python Pandas VLOOKUP function with categorical and non-numeric values
Python Pandas VLOOKUP function with categorical and non-numeric values

Time:11-13

I want to optimize a process of a "vlookup" in Python that works but is not scalable in its current form. I have tried pythons pivot.table and pivot but it's been limited due to alphanumeric and string values in cells. I have two tables:

table1:

ProductID Sales
123456 34
abc123 34
123def 34
a1234f 34
1abcd6 34

table2:

Brand Site1 Site2 Site3
Brand1 123456 N/A N/A
Brand2 N/A abc123 N/A
Brand1 N/A N/A 123def
Brand2 N/A 1abcd6 N/A
Brand1 a1234f N/A N/A

What I originally wanted to see was sales by brand:

Brand Sales
Brand1 102
Brand2 68

Here's the pseudocode I've basically built out in Python and Pandas:

#  read sales and product tables into pandas
sales_df = pd.read_csv(table1)
product_df = pd.read_csv(table2)

#  isolate each product id column into separate dfs
product_site1_df = product_df.drop(['Site2', 'Site3'],axis=1)
product_site2_df = product_df.drop(['Site1', 'Site3'],axis=1)
product_site3_df = product_df.drop(['Site1', 'Site2'],axis=1)

#  rename and append all product ids into a single column
product_site1_df.rename(columns={"Site1": "ProductID"})
product_site2_df.rename(columns={"Site2": "ProductID"})
product_site3_df.rename(columns={"Site3": "ProductID"})

product_list_master_df = pd.concat([product_site1_df, product_site2_df, product_site3_df])


#compare sales df and product df, pulling brand in as a new column to the sales table
inner_join = pd.merge(sales_df, 
                  product_df, 
                  on ='ProductID', 
                  how ='inner')

This is obviously very procedural, not scalable, computationally redundant, and seems very round-about to get to what I want. Additionally, I'm losing data such as if I want to do a pivot based on sites rather than sales. Short of changing the data model itself, what can I do here to improve speed, versatility, and lines of code?

CodePudding user response:

Assuming the dataframes are named df1 and df2, you can reshape and map to perform the VLOOKUP, then groupby sum:

(df2.set_index('Brand')
    .stack()
    .map(df1.set_index('ProductID')['Sales'])
    .groupby(level='Brand').sum()
 )

Output:

Brand
Brand1    102
Brand2     68

CodePudding user response:

Here's how you can do it, without Pandas, just using Python's standard CSV lib, and a Counter (for your sales-by-brand):

import csv
from collections import Counter

# Create a product/sales lookup
sales_by_product = {}

with open('sales.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    for row in reader:
        p_id, sales = row
        sales_by_product[p_id] = int(sales)


sales_by_brand_counter = Counter()

with open('products.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    for row in reader:
        brand_id = row[0]
        for p_id in row[1:]:
            sales = sales_by_product.get(p_id, 0)
            sales_by_brand_counter[brand_id]  = sales

with open('sales_by_brand.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Brand', 'Sales'])

    rows = [[elem, cnt] for (elem, cnt) in sales_by_brand_counter.items()]
    writer.writerows(rows)

When I run that with sales.csv:

ProductID,Sales
123456,34
abc123,34
123def,34
a1234f,34
1abcd6,34

and products.csv:

Brand,Site1,Site2,Site3
Brand1,123456,N/A,N/A
Brand2,N/A,abc123,N/A
Brand1,N/A,N/A,123def
Brand2,N/A,1abcd6,N/A
Brand1,a1234f,N/A,N/A

I get sales_by_brand.csv:

Brand,Sales
Brand1,102
Brand2,68

The work that really matters, finding product IDs and summing sales is handled here:

for row in reader:
    brand_id = row[0]
    for p_id in row[1:]:
        sales = sales_by_product.get(p_id, 0)
        sales_by_brand_counter[brand_id]  = sales

It can read through as many Site columns as there are. If the site contains 'N/A' or a product ID that isn't in the lookup dict, it just adds 0 to that brand.

  • Related