I have the following dataframe
en | ko |
---|---|
Tuberculosis of heart | 심장의 결핵 |
Tuberculosis of myocardium | 심근의 결핵 |
Tuberculosis of endocardium | 심내막의 결핵 |
Tuberculosis of oesophagus | 식도의 결핵 |
Zoster keratoconjunctivitis | 대상포진 각막결막염 |
Zoster blepharitis | 대상포진 안검염 |
Zoster iritis | 대상포진 홍채염 |
I want a result like this.
en | ko |
---|---|
heart | 심장의 |
myocardium | 심근의 |
endocardium | 심내막의 |
oesophagus | 식도의 |
keratoconjunctivitis | 각막결막염 |
blepharitis | 안검염 |
iritis | 홍채염 |
This is just an example, I have about 50,000 word pairs. Been doing this for 1 week now.
CodePudding user response:
You can use:
import re
# identify duplicates
s = df.stack().str.split().explode()
dups = s[s.duplicated()].groupby(level=1).unique().to_dict()
# {'en': array(['Tuberculosis', 'of', 'Zoster'], dtype=object),
# 'ko': array(['결핵', '대상포진'], dtype=object)}
# remove them
df.apply(lambda s: s.str.replace('|'.join(dups[s.name]), '', regex=True))
Output:
en ko
0 heart 심장의
1 myocardium 심근의
2 endocardium 심내막의
3 oesophagus 식도의
4 keratoconjunctivitis 각막결막염
5 blepharitis 안검염
6 iritis 홍채염
CodePudding user response:
I don't know how extensible this will be to a larger dataset, given tat I don't know the structure of korean re:whitespace between entities, but it works on the given data.
We split the data into two columns, as the preposition "of" doesn't appear to exist in the 'ko' column and that impacts following steps. Then for each column, we split on whitespace to make list columns, we explode those to individual rows, then we get the value counts to determine which elements appear more than once
ko=df['ko'].str.split().explode().value_counts()
en=df['en'].str.split().explode().value_counts()
ko
결핵 4
대상포진 3
심장의 1
심근의 1
심내막의 1
식도의 1
각막결막염 1
안검염 1
홍채염 1
Name: ko, dtype: int64
After that, we use boolean indexing to select only those elements that appear only once for each series
ko_col=ko[ko==1]
en_col=en[en==1]
en_col
heart 1
myocardium 1
endocardium 1
oesophagus 1
keratoconjunctivitis 1
blepharitis 1
iritis 1
Name: en, dtype: int64
We rely on the fact that order should be preserved in the above steps, but worth spot checking in your larger dataset, and we recombine to create your output dataframe
new_df=pd.DataFrame({'en':en_col.index,'ko':ko_col.index})
new_df
en ko
0 heart 심장의
1 myocardium 심근의
2 endocardium 심내막의
3 oesophagus 식도의
4 keratoconjunctivitis 각막결막염
5 blepharitis 안검염
6 iritis 홍채염