Home > database >  For all values ​in a row, if a certain word is duplicated more than once, we want to remove it from
For all values ​in a row, if a certain word is duplicated more than once, we want to remove it from

Time:11-22

I have the following dataframe

en ko
Tuberculosis of heart 심장의 결핵
Tuberculosis of myocardium 심근의 결핵
Tuberculosis of endocardium 심내막의 결핵
Tuberculosis of oesophagus 식도의 결핵
Zoster keratoconjunctivitis 대상포진 각막결막염
Zoster blepharitis 대상포진 안검염
Zoster iritis 대상포진 홍채염

I want a result like this.

en ko
heart 심장의
myocardium 심근의
endocardium 심내막의
oesophagus 식도의
keratoconjunctivitis 각막결막염
blepharitis 안검염
iritis 홍채염

This is just an example, I have about 50,000 word pairs. Been doing this for 1 week now.

CodePudding user response:

You can use:

import re

# identify duplicates
s = df.stack().str.split().explode()
dups = s[s.duplicated()].groupby(level=1).unique().to_dict()
# {'en': array(['Tuberculosis', 'of', 'Zoster'], dtype=object),
#  'ko': array(['결핵', '대상포진'], dtype=object)}

# remove them
df.apply(lambda s: s.str.replace('|'.join(dups[s.name]), '', regex=True))

Output:

                     en     ko
0                 heart    심장의
1            myocardium    심근의
2           endocardium   심내막의
3            oesophagus    식도의
4  keratoconjunctivitis  각막결막염
5           blepharitis    안검염
6                iritis    홍채염

CodePudding user response:

I don't know how extensible this will be to a larger dataset, given tat I don't know the structure of korean re:whitespace between entities, but it works on the given data.

We split the data into two columns, as the preposition "of" doesn't appear to exist in the 'ko' column and that impacts following steps. Then for each column, we split on whitespace to make list columns, we explode those to individual rows, then we get the value counts to determine which elements appear more than once

ko=df['ko'].str.split().explode().value_counts()
en=df['en'].str.split().explode().value_counts()

ko
결핵       4
대상포진     3
심장의      1
심근의      1
심내막의     1
식도의      1
각막결막염    1
안검염      1
홍채염      1
Name: ko, dtype: int64

After that, we use boolean indexing to select only those elements that appear only once for each series

ko_col=ko[ko==1]
en_col=en[en==1]

en_col
heart                   1
myocardium              1
endocardium             1
oesophagus              1
keratoconjunctivitis    1
blepharitis             1
iritis                  1
Name: en, dtype: int64

We rely on the fact that order should be preserved in the above steps, but worth spot checking in your larger dataset, and we recombine to create your output dataframe

new_df=pd.DataFrame({'en':en_col.index,'ko':ko_col.index})
new_df
    en                      ko
0   heart                   심장의
1   myocardium              심근의
2   endocardium             심내막의
3   oesophagus              식도의
4   keratoconjunctivitis    각막결막염
5   blepharitis             안검염
6   iritis                  홍채염
  • Related