Working with two pandas dataframes with a one-to-many relationship?-CodePudding

I have 2 DFs with one-to-many mapping like so... (goal is to search all medications that have ingredient X)

import pandas as pd

medications = pd.DataFrame({
                    'med_id': [1, 2],
                    'med_name': ['multivitamin', 'aspirin'],
                    })

ingredients = pd.DataFrame({
                    'med_id': [1, 1, 2],
                    'ingredient_id': [1, 2, 3],
                    'ingredient_name': ['vitamin C', 'vitamin D', 'acetylacetic acid'] })

ingredients.groupby('med_id')['ingredient_name'].apply(list)
# Creates df, example row:  1 | multivitamin | ['vitamin D', 'vitamin C']

merged = pd.merge(medications, ingredients, how="inner", on="med_id")

This is what it creates:

medication_id | medication_name    | ingredient_list
1               'multivitamin'     ['vitamin D', 'vitamin C', ...]
...

However, I read online that pandas.df isn't meant to have column cells be lists. It's a pain if we have to set df.loc[..., 'ingredients'] = ['apple extract'] because it expects a list matching the number of rows in df.loc, not the list that is the "value" of the cell. Also, doing something like merged.loc[['vitamin C'] in merged['ingredient_list']] doesn't work.

Can someone suggests how to manage such one-to-many data, and query left-side of relation based on mapping of another df?
In this example I merge and make the column be a list. Another option is to have multiple columns ['ingredient1','ingredient2'...], but it can get quite large. Do you keep them separate DFs and query df1 somehow based on mapping of df2?

CodePudding user response：

If I understand correctly you are trying to add the list to the dataframe

ingredients = pd.DataFrame({
                    'med_id': [1, 1, 2],
                    'ingredient_id': [1, 2, 3],
                    'ingredient_name': [str(['vitamin C', 'vitamin D', 'acetylacetic acid']),"",""] })

and you can use this string as list like:

eval(merged["ingredient_name"].loc[0])

CodePudding user response：

I interpreted this slightly differently, given your 'goal' in parentheses at the start of your question:

to search all medications that have ingredient X

Data:

import pandas as pd

medications = pd.DataFrame({
                    'med_id': [1, 2],
                    'med_name': ['multivitamin', 'aspirin'],
                    })

med_dict = medications.set_index("med_id").to_dict()["med_name"]

ingredients = pd.DataFrame({
                    'med_id': [1, 1, 2, 2],
                    'ingredient_id': [1, 2, 3, 1],
                    'ingredient_name': ['vitamin C', 'vitamin D', 'acetylacetic acid', 'vitamin C'] })

I added in an additional "vitamin C" for the other medication, to make it easier to see how this code works.

The one-to-many mapping:

ingredient_meds = ingredients.groupby("ingredient_name").apply(lambda x: list(x["med_id"].map(med_dict)))
ingredient_meds
#Out: 
#ingredient_name
#acetylacetic acid                  [aspirin]
#vitamin C            [multivitamin, aspirin]
#vitamin D                     [multivitamin]
#dtype: object

And then to search:

ingredient_meds["vitamin C"]
#Out: ['multivitamin', 'aspirin']

So this will create a list of all medications that use a particular ingredient, so that you can search by ingredient and return the medications that use it.