I have a question, currently I have for example the following dataframe (this is just an excerpt, in reality it would be a lot bigger (about a few million rows).
topic keyword
0 String A String A
1 String A String B
3 String B String A
4 String B String B
5 String B String D
6 String C String D
...
Now I kind of want to keep the first co-occurence sort of, if String B is already "taken" in the "keyword" column, it cannot be in the topic column anymore. IF it's the first time though, keep it.
topic keyword
0 String A String A
1 String A String B
3 String B String A -> Topic is already used in keyword, so delete it
4 String B String B -> Topic is already used in keyword, so delete it
5 String B String D -> Topic is already used in keyword, so delete it
6 String C String D
...
In the end I'd like the following result.
topic keyword
0 String A String A
1 String A String B
2 String C String D
...
How can I achieve this in the fastest fashion?
CodePudding user response:
You can try remove duplicated with reshape by DataFrame.stack
:
s = (df[['topic','keyword']].stack()
.drop_duplicates()
.unstack()['topic']
.reindex(df.index)
.ffill())
df = df[df['topic'].eq(s)]
print (df)
topic keyword
0 String A String A
1 String A String B
5 String C String D
CodePudding user response:
You can do it with a for loop to create a new dataframe:
new_rows = []
keywords = set()
for ind, row in df.iterrows():
if row['topic'] in keywords:
continue
keywords.add(row['keyword'])
new_rows.append(row)
# this new df is what you want
new_df = pd.DataFrame(new_rows)