I have two dataframes:
`df_1_long = pd.DataFrame({
'company_name': ['Company A', 'Company B', 'Company C'],
'company_country': ['USA', 'Poland', 'Canada'],
'keyword': ['holding', 'services', 'source'],
'value': [1,0,1]
})`
and second df:
`df_training = pd.DataFrame({
'holding': [1, 0, 0],
'services': [0, 1, 0],
'source': [0, 0, 1],
'sector': ['Retail', 'Finance', 'Energy']
})`
columns in df_training ['holding', 'services', 'source']
are the keywords in column 'keyword' in df_1_long. I would like to assign a sector to df_1_long - if 'keyword' in df_1_long value is 1 and in df_training value is 1 for the keyword in a column -> then assign a sector from df_training.
The output should look like that:
`expected_output = pd.DataFrame({
'company_name': ['Company A', 'Company B', 'Company C'],
'company_country': ['USA', 'Poland', 'Canada'],
'keyword': ['holding', 'services', 'source'],
'value': [1,0,1],
'sector': ['Retail', 'no_sector', 'Energy']
})`
I tried this piece of code, but keep getting errors:
`merged_df = pd.merge(df_1_long, df_training, left_on='keyword', right_on=df_training.columns[:-1])
`df_1_long['sector'] = merged_df['sector'].where(merged_df['value'] == 1, np.nan)``
CodePudding user response:
Use DataFrame.stack
for reshape with filter 1
values and get dictionary from MultiIndex
used for Series.map
:
s = df_training.set_index('sector').stack().loc[lambda x: x.eq(1)].index.swaplevel()
m = df_1_long['value'].eq(1)
df_1_long['sector'] = np.where(m, df_1_long['keyword'].map(dict(s)), 'no_sector')
Another solution with DataFrame.melt
, DataFrame.query
and DataFrame.set_index
:
d = df_training.melt('sector').query('value == 1').set_index('variable')['sector']
m = df_1_long['value'].eq(1)
df_1_long['sector'] = np.where(m, df_1_long['keyword'].map(d), 'no_sector')
print (df_1_long)
company_name company_country keyword value sector
0 Company A USA holding 1 Retail
1 Company B Poland services 0 no_sector
2 Company C Canada source 1 Energy
CodePudding user response:
You can use from_dummies
to reshape and map
those values:
mapper = dict(zip(pd.from_dummies(df_training.set_index('sector')).iloc[:, 0],
df_training['sector']))
df_1_long['sector'] = (df_1_long['keyword'].map(mapper)
.mask(df_1_long['value'].eq(0), 'no_sector')
)
Output:
company_name company_country keyword value sector
0 Company A USA holding 1 Retail
1 Company B Poland services 0 no_sector
2 Company C Canada source 1 Energy