I have 2 DFs with one-to-many mapping like so... (goal is to search all medications that have ingredient X)
import pandas as pd
medications = pd.DataFrame({
'med_id': [1, 2],
'med_name': ['multivitamin', 'aspirin'],
})
ingredients = pd.DataFrame({
'med_id': [1, 1, 2],
'ingredient_id': [1, 2, 3],
'ingredient_name': ['vitamin C', 'vitamin D', 'acetylacetic acid'] })
ingredients.groupby('med_id')['ingredient_name'].apply(list)
# Creates df, example row: 1 | multivitamin | ['vitamin D', 'vitamin C']
merged = pd.merge(medications, ingredients, how="inner", on="med_id")
This is what it creates:
medication_id | medication_name | ingredient_list
1 'multivitamin' ['vitamin D', 'vitamin C', ...]
...
However, I read online that pandas.df isn't meant to have column cells be lists. It's a pain if we have to set df.loc[..., 'ingredients'] = ['apple extract'] because it expects a list matching the number of rows in df.loc, not the list that is the "value" of the cell. Also, doing something like merged.loc[['vitamin C'] in merged['ingredient_list']]
doesn't work.
Can someone suggests how to manage such one-to-many data, and query left-side of relation based on mapping of another df?
In this example I merge and make the column be a list. Another option is to have multiple columns ['ingredient1','ingredient2'...], but it can get quite large. Do you keep them separate DFs and query df1 somehow based on mapping of df2?
CodePudding user response:
If I understand correctly you are trying to add the list to the dataframe
ingredients = pd.DataFrame({
'med_id': [1, 1, 2],
'ingredient_id': [1, 2, 3],
'ingredient_name': [str(['vitamin C', 'vitamin D', 'acetylacetic acid']),"",""] })
and you can use this string as list like:
eval(merged["ingredient_name"].loc[0])
CodePudding user response:
I interpreted this slightly differently, given your 'goal' in parentheses at the start of your question:
to search all medications that have ingredient X
Data:
import pandas as pd
medications = pd.DataFrame({
'med_id': [1, 2],
'med_name': ['multivitamin', 'aspirin'],
})
med_dict = medications.set_index("med_id").to_dict()["med_name"]
ingredients = pd.DataFrame({
'med_id': [1, 1, 2, 2],
'ingredient_id': [1, 2, 3, 1],
'ingredient_name': ['vitamin C', 'vitamin D', 'acetylacetic acid', 'vitamin C'] })
I added in an additional "vitamin C" for the other medication, to make it easier to see how this code works.
The one-to-many mapping:
ingredient_meds = ingredients.groupby("ingredient_name").apply(lambda x: list(x["med_id"].map(med_dict)))
ingredient_meds
#Out:
#ingredient_name
#acetylacetic acid [aspirin]
#vitamin C [multivitamin, aspirin]
#vitamin D [multivitamin]
#dtype: object
And then to search:
ingredient_meds["vitamin C"]
#Out: ['multivitamin', 'aspirin']
So this will create a list of all medications that use a particular ingredient, so that you can search by ingredient and return the medications that use it.