Pandas iterate over rows and build new frame-CodePudding

I have a question, currently I have for example the following dataframe (this is just an excerpt, in reality it would be a lot bigger (about a few million rows).

            topic     keyword                                                             
    0    String A    String A
    1    String A    String B
    3    String B    String A
    4    String B    String B
    5    String B    String D
    6    String C    String D
...

Now I kind of want to keep the first co-occurence sort of, if String B is already "taken" in the "keyword" column, it cannot be in the topic column anymore. IF it's the first time though, keep it.

            topic     keyword                                                             
    0    String A    String A
    1    String A    String B
    3    String B    String A -> Topic is already used in keyword, so delete it
    4    String B    String B -> Topic is already used in keyword, so delete it
    5    String B    String D -> Topic is already used in keyword, so delete it
    6    String C    String D
...

In the end I'd like the following result.

            topic     keyword                                                             
    0    String A    String A
    1    String A    String B
    2    String C    String D
...

How can I achieve this in the fastest fashion?

CodePudding user response：

You can try remove duplicated with reshape by DataFrame.stack:

s = (df[['topic','keyword']].stack()
                            .drop_duplicates()
                            .unstack()['topic']
                            .reindex(df.index)
                            .ffill())

df = df[df['topic'].eq(s)]
print (df)
      topic   keyword
0  String A  String A
1  String A  String B
5  String C  String D

CodePudding user response：

You can do it with a for loop to create a new dataframe:

new_rows = []
keywords = set()
for ind, row in df.iterrows():
    if row['topic'] in keywords:
        continue
    keywords.add(row['keyword'])
    new_rows.append(row)

# this new df is what you want
new_df = pd.DataFrame(new_rows)