I want to optimize a process of a "vlookup" in Python that works but is not scalable in its current form. I have tried pythons pivot.table and pivot but it's been limited due to alphanumeric and string values in cells. I have two tables:
table1:
ProductID | Sales |
---|---|
123456 | 34 |
abc123 | 34 |
123def | 34 |
a1234f | 34 |
1abcd6 | 34 |
table2:
Brand | Site1 | Site2 | Site3 |
---|---|---|---|
Brand1 | 123456 | N/A | N/A |
Brand2 | N/A | abc123 | N/A |
Brand1 | N/A | N/A | 123def |
Brand2 | N/A | 1abcd6 | N/A |
Brand1 | a1234f | N/A | N/A |
What I originally wanted to see was sales by brand:
Brand | Sales |
---|---|
Brand1 | 102 |
Brand2 | 68 |
Here's the pseudocode I've basically built out in Python and Pandas:
# read sales and product tables into pandas
sales_df = pd.read_csv(table1)
product_df = pd.read_csv(table2)
# isolate each product id column into separate dfs
product_site1_df = product_df.drop(['Site2', 'Site3'],axis=1)
product_site2_df = product_df.drop(['Site1', 'Site3'],axis=1)
product_site3_df = product_df.drop(['Site1', 'Site2'],axis=1)
# rename and append all product ids into a single column
product_site1_df.rename(columns={"Site1": "ProductID"})
product_site2_df.rename(columns={"Site2": "ProductID"})
product_site3_df.rename(columns={"Site3": "ProductID"})
product_list_master_df = pd.concat([product_site1_df, product_site2_df, product_site3_df])
#compare sales df and product df, pulling brand in as a new column to the sales table
inner_join = pd.merge(sales_df,
product_df,
on ='ProductID',
how ='inner')
This is obviously very procedural, not scalable, computationally redundant, and seems very round-about to get to what I want. Additionally, I'm losing data such as if I want to do a pivot based on sites rather than sales. Short of changing the data model itself, what can I do here to improve speed, versatility, and lines of code?
CodePudding user response:
Assuming the dataframes are named df1
and df2
, you can reshape and map
to perform the VLOOKUP, then groupby
sum
:
(df2.set_index('Brand')
.stack()
.map(df1.set_index('ProductID')['Sales'])
.groupby(level='Brand').sum()
)
Output:
Brand
Brand1 102
Brand2 68
CodePudding user response:
Here's how you can do it, without Pandas, just using Python's standard CSV lib, and a Counter (for your sales-by-brand):
import csv
from collections import Counter
# Create a product/sales lookup
sales_by_product = {}
with open('sales.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
p_id, sales = row
sales_by_product[p_id] = int(sales)
sales_by_brand_counter = Counter()
with open('products.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
brand_id = row[0]
for p_id in row[1:]:
sales = sales_by_product.get(p_id, 0)
sales_by_brand_counter[brand_id] = sales
with open('sales_by_brand.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Brand', 'Sales'])
rows = [[elem, cnt] for (elem, cnt) in sales_by_brand_counter.items()]
writer.writerows(rows)
When I run that with sales.csv:
ProductID,Sales
123456,34
abc123,34
123def,34
a1234f,34
1abcd6,34
and products.csv:
Brand,Site1,Site2,Site3
Brand1,123456,N/A,N/A
Brand2,N/A,abc123,N/A
Brand1,N/A,N/A,123def
Brand2,N/A,1abcd6,N/A
Brand1,a1234f,N/A,N/A
I get sales_by_brand.csv:
Brand,Sales
Brand1,102
Brand2,68
The work that really matters, finding product IDs and summing sales is handled here:
for row in reader:
brand_id = row[0]
for p_id in row[1:]:
sales = sales_by_product.get(p_id, 0)
sales_by_brand_counter[brand_id] = sales
It can read through as many Site columns as there are. If the site contains 'N/A' or a product ID that isn't in the lookup dict, it just adds 0 to that brand.