For all values in a row, if a certain word is duplicated more than once, we want to remove it from-CodePudding

I have the following dataframe

en	ko
Tuberculosis of heart	심장의 결핵
Tuberculosis of myocardium	심근의 결핵
Tuberculosis of endocardium	심내막의 결핵
Tuberculosis of oesophagus	식도의 결핵
Zoster keratoconjunctivitis	대상포진 각막결막염
Zoster blepharitis	대상포진 안검염
Zoster iritis	대상포진 홍채염

I want a result like this.

en	ko
heart	심장의
myocardium	심근의
endocardium	심내막의
oesophagus	식도의
keratoconjunctivitis	각막결막염
blepharitis	안검염
iritis	홍채염

This is just an example, I have about 50,000 word pairs. Been doing this for 1 week now.

CodePudding user response：

You can use:

import re

# identify duplicates
s = df.stack().str.split().explode()
dups = s[s.duplicated()].groupby(level=1).unique().to_dict()
# {'en': array(['Tuberculosis', 'of', 'Zoster'], dtype=object),
#  'ko': array(['결핵', '대상포진'], dtype=object)}

# remove them
df.apply(lambda s: s.str.replace('|'.join(dups[s.name]), '', regex=True))

Output:

                     en     ko
0                 heart    심장의
1            myocardium    심근의
2           endocardium   심내막의
3            oesophagus    식도의
4  keratoconjunctivitis  각막결막염
5           blepharitis    안검염
6                iritis    홍채염

CodePudding user response：

I don't know how extensible this will be to a larger dataset, given tat I don't know the structure of korean re:whitespace between entities, but it works on the given data.

We split the data into two columns, as the preposition "of" doesn't appear to exist in the 'ko' column and that impacts following steps. Then for each column, we split on whitespace to make list columns, we explode those to individual rows, then we get the value counts to determine which elements appear more than once

ko=df['ko'].str.split().explode().value_counts()
en=df['en'].str.split().explode().value_counts()

ko
결핵       4
대상포진     3
심장의      1
심근의      1
심내막의     1
식도의      1
각막결막염    1
안검염      1
홍채염      1
Name: ko, dtype: int64

After that, we use boolean indexing to select only those elements that appear only once for each series

ko_col=ko[ko==1]
en_col=en[en==1]

en_col
heart                   1
myocardium              1
endocardium             1
oesophagus              1
keratoconjunctivitis    1
blepharitis             1
iritis                  1
Name: en, dtype: int64

We rely on the fact that order should be preserved in the above steps, but worth spot checking in your larger dataset, and we recombine to create your output dataframe

new_df=pd.DataFrame({'en':en_col.index,'ko':ko_col.index})
new_df
    en                      ko
0   heart                   심장의
1   myocardium              심근의
2   endocardium             심내막의
3   oesophagus              식도의
4   keratoconjunctivitis    각막결막염
5   blepharitis             안검염
6   iritis                  홍채염