Home > Blockchain >  Replace Multiple ID Values by Name Values in a dataframe's single column by iterating through
Replace Multiple ID Values by Name Values in a dataframe's single column by iterating through

Time:12-14

I have these two dataframes one for product file and one for category file (from CSVs):

Product File csv:

prod_id prod_name category_id
123 Red T-Shirt 501,502,999
345 Blue Dress 601,602

Category File csv:

category_id category_name
501 Mens
502 T-Shirts
601 Women
602 Dresses
999 Sale

Expected output as csv:

prod_id prod_name category_id category_name
123 Red T-Shirt 501,502,999 Mens,T-Shirts,Sale
345 Blue Dress 601,602 Women,Dresses

Code so far:

import pandas as pd

df_products = pd.read_csv("products.csv", dtype="str")
df_categories = pd.read_csv("categories.csv", dtype="str")

cached_prod_id = df_products["prod_id"]
cached_prod_name = df_products["prod_name"]
cached_prod_category_id = df_products["category_id"]

cached_category_id = df_categories["category_id"]
cached_category_name = df_categories["category_name"]

inner_join = pd.merge(df_products,df_categories, on="category_id", how="category_name")

print(inner_join)

Error:

Traceback (most recent call last):
  File "C:\Users\Admin\projects\python\catagories_builder.py", line 29, in <module>
    inner_join = pd.merge(df_products,
  File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 106, in merge
    op = _MergeOperation(
  File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 699, in __init__
    ) = self._get_merge_keys()
  File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 1096, in _get_merge_keys
    right_keys.append(right._get_label_or_level_values(rk))
  File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py", line 1779, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'category_id'

I believe this is obviously not the correct code or even the approach is incorrect; as I am still new and still learning python / pandas. I may have a feeling that I need to iterate in the category_id column in product csv to get the values that are separated by commas THEN do a replace or something??!

I have explored index, map, split and replace functionality in pandas but unfortunately haven't worked it out. I am sure this is simple thing and is there already an efficient approach to solve this problem, hopefully :)

Thank you!!

CodePudding user response:

Solution

  • Create a mapping series from df_categories
  • Split and explode the category_id column
  • Substitute the values in the exploded catgeory_id using the values from mapping series
  • Group the column by index and aggregate the column using join
d = df_categories.astype(str).set_index('category_id')['category_name']

df_products['category_name'] = (
    df_products['category_id'].str.split(',')
    .explode().map(d).groupby(level=0).agg(','.join)
)

>>> df_products

   prod_id    prod_name  category_id       category_name
0      123  Red T-Shirt  501,502,999  Mens,T-Shirts,Sale
1      345   Blue Dress      601,602       Women,Dresses
  • Related