I have these two dataframes one for product file and one for category file (from CSVs):
Product File csv:
prod_id | prod_name | category_id |
---|---|---|
123 | Red T-Shirt | 501,502,999 |
345 | Blue Dress | 601,602 |
Category File csv:
category_id | category_name |
---|---|
501 | Mens |
502 | T-Shirts |
601 | Women |
602 | Dresses |
999 | Sale |
Expected output as csv:
prod_id | prod_name | category_id | category_name |
---|---|---|---|
123 | Red T-Shirt | 501,502,999 | Mens,T-Shirts,Sale |
345 | Blue Dress | 601,602 | Women,Dresses |
Code so far:
import pandas as pd
df_products = pd.read_csv("products.csv", dtype="str")
df_categories = pd.read_csv("categories.csv", dtype="str")
cached_prod_id = df_products["prod_id"]
cached_prod_name = df_products["prod_name"]
cached_prod_category_id = df_products["category_id"]
cached_category_id = df_categories["category_id"]
cached_category_name = df_categories["category_name"]
inner_join = pd.merge(df_products,df_categories, on="category_id", how="category_name")
print(inner_join)
Error:
Traceback (most recent call last):
File "C:\Users\Admin\projects\python\catagories_builder.py", line 29, in <module>
inner_join = pd.merge(df_products,
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 106, in merge
op = _MergeOperation(
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 699, in __init__
) = self._get_merge_keys()
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 1096, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py", line 1779, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'category_id'
I believe this is obviously not the correct code or even the approach is incorrect; as I am still new and still learning python / pandas. I may have a feeling that I need to iterate in the category_id column in product csv to get the values that are separated by commas THEN do a replace or something??!
I have explored index, map, split and replace functionality in pandas but unfortunately haven't worked it out. I am sure this is simple thing and is there already an efficient approach to solve this problem, hopefully :)
Thank you!!
CodePudding user response:
Solution
- Create a mapping series from
df_categories
Split
andexplode
thecategory_id
column- Substitute the values in the exploded
catgeory_id
using the values from mapping series - Group the column by index and aggregate the column using
join
d = df_categories.astype(str).set_index('category_id')['category_name']
df_products['category_name'] = (
df_products['category_id'].str.split(',')
.explode().map(d).groupby(level=0).agg(','.join)
)
>>> df_products
prod_id prod_name category_id category_name
0 123 Red T-Shirt 501,502,999 Mens,T-Shirts,Sale
1 345 Blue Dress 601,602 Women,Dresses