I am trying to join two the dataframes as shown below on the code
column values present in the name_data
dataframe.
I have two dataframes shown below and I expect to have a resulting dataframe which would only have the rows from the `team_datadataframe where the corresponding
codevalue column is present in the
name_data``` dataframe.
I am using koalas for this on databricks and I have the following code using the join operation.
import databricks.koalas as ks
name_data= ks.DataFrame({'code':['123a', '345b', '678c'],
'id':[1, 2, 3]})
team_data = ks.DataFrame({'code':['123a', '23s', '34a'],
'id':[1, 2, 3]})
team_data_filtered = team_data.join(name_data.set_index('code'), on='code')
display(team_data_filtered)
The expected output would be to see only the following in team_data_filtered
.
Code id
'123a' 1
But my code is throwing an error stating that columns overlap but no suffix specified: ['id']
.
May someone help to resolve this issue?
CodePudding user response:
Try adding suffix parameters:
team_data_filtered = team_data.join(name_data.set_index('code'), on='code',
lsuffix='_1', rsuffix='_2')
team_data_filtered = team_data_filtered.loc[team_data_filtered.id_1==team_data_filtered.id2]
display(team_data_filtered)
An then to clean the columns, if desired:
team_data_filtered.rename({'id_1':'id'}, inplace=True, axis=1)