I try to solve the following problem:
I have these two dataframes, df1
:
label | term |
---|---|
A01.12 | "Culture de légumes, de melons, de racines et de tubercules" |
A01.14 | Culture de la canne à sucre |
and df2
:
company | organization_activity |
---|---|
coca-cola | "Culture de légumes, de melons, de racines et de tubercules" |
rum-factory | Culture de la canne à sucre |
I now would like to map the "term" of df1
to the "organization_activity" of df2
and create a new column with the name "codes" in df2
with the respective label in df1
.
The resulting dataframe should look like this:
organization_activity | codes |
---|---|
"Culture de légumes, de melons, de racines et de tubercules" | A01.12 |
Culture de la canne à sucre | A01.14 |
This partially works by using:
df2['codes'] = df2['organization_activity'].map(df1.set_index('term')['label'].to_dict())
Oddly the new column is created and filled but for a lot of terms the mapping does not work and a Nan
value is inserted even though the "organization_activity" I am mapping exists in the "term" column. Does anyone have an idea on why this could happen? The formatting for the values in "term" and "organization_activity" is equal and there doesn't seem to be a difference in the format of values that are mappable and ones that are not. Is there maybe a different way to do this?
The dataframe I showed is just an abstract so there are multiple companies that can have the same "organization_activity".
CodePudding user response:
Normally, that is what you call a merge. Assuming you have the following dataframes:
df1 = pd.DataFrame({'Label': ['A01.12', 'A01.14'],
'term': ['Culture de légumes, de melons, de racines et de tubercules',
'Culture de la canne à sucre']})
df2 = pd.DataFrame({'organization_activity': [
'Culture de légumes, de melons, de racines et de tubercules',
'Culture de la canne à sucre']})
>>> df1
Label term
0 A01.12 Culture de légumes, de melons, de racines et d...
1 A01.14 Culture de la canne à sucre
>>> df2
organization_activity
0 Culture de légumes, de melons, de racines et d...
1 Culture de la canne à sucre
This is what a merge would give you after some cleaning:
>>> df3 = df2.merge(df1, left_on="organization_activity", right_on="term")
>>> df3.drop("term", axis=1).rename(columns={"Label": "Code"})
organization_activity Code
0 Culture de légumes, de melons, de racines et d... A01.12
1 Culture de la canne à sucre A01.14
I also assume that there are some more columns in df2
, too, which should also correctly be mapped to Labels
/Codes
with a merge.
CodePudding user response:
Consider below dataframes from the question:
df1 = pd.DataFrame({'label': ['A01.12', 'A01.14'],
'term': ['"Culture de légumes, de melons, de racines et de tubercules"', 'Culture de la canne à sucre']})
df2 = pd.DataFrame({'company': ['coca-cola', 'rum-factory'],
'organization_activity': ['"Culture de légumes, de melons, de racines et de tubercules"', ' Culture de la canne à sucre']})
print(df1)
label term
0 A01.12 "Culture de légumes, de melons, de racines et de tubercules"
1 A01.14 Culture de la canne à sucre
print(df2)
company organization_activity
0 coca-cola "Culture de légumes, de melons, de racines et de tubercules"
1 rum-factory Culture de la canne à sucre
Actually there are some white spaces in the above dataframe which are barely visible. I suggest you do some data cleaning first. For example, to replace double spaces by single space, and to remove white spaces in front and behind the stings.
for df in [df1, df2]:
for col in df.columns:
df[col] = df[col].str.strip().str.replace(' ', ' ')
df2['codes'] = df2['organization_activity'].map(df1.set_index('term')['label'].to_dict())
print(df2)
Output:
company organization_activity codes
0 coca-cola "Culture de légumes, de melons, de racines et de tubercules" A01.12
1 rum-factory Culture de la canne à sucre A01.14
Alternatively you could use pd.merge()
to merge the 2 dataframes:
df3 = pd.merge(df2, df1, left_on='organization_activity', right_on='term').drop(columns=['term'])
df3.columns = ['company', 'organization_activity', 'codes']
print(df3)
company organization_activity codes
0 coca-cola "Culture de légumes, de melons, de racines et de tubercules" A01.12
1 rum-factory Culture de la canne à sucre A01.14