I have more than 100 thousands of phonetic entry in a dictionary file used for my Chinese input method. Some contributors cannot input zhuyin (bopomofo) but can only input pinyin (with numeral intonation).
I wrote an sed script which can automatically convert such pinyin notation to zhuyin. However, sed takes really really long time to perform this operation through the entire file 'cause the sed file itself is big ... almost 500 lines of regex.
I am thinking of whether sed can only be triggered to those lines containing pinyin?
Sample input:
鸞膠續斷 luan2 jiao1 xu4 duan4
鸞駕 luan2 jia4
鸞儔鳳侶 ㄌㄨㄢˊ ㄔㄡˊ ㄈㄥˋ ㄌㄩˇ
鸞輿 ㄌㄨㄢˊ ㄩˊ
鸞鵠停峙 ㄌㄨㄢˊ ㄏㄨˊ ㄊㄧㄥˊ ㄓˋ
鸞飄鳳泊 ㄌㄨㄢˊ ㄆㄧㄠ ㄈㄥˋ ㄅㄛˊ
鸞鶴 ㄌㄨㄢˊ ㄏㄜˋ
鸞鑑 ㄌㄨㄢˊ ㄐㄧㄢˋ
灩澦堆 yan4 yu4 dui1
灩灩 yan4 yan4
籲天 ㄩˋ ㄊㄧㄢ
籲求 ㄩˋ ㄑㄧㄡˊ
籲請 ㄩˋ ㄑㄧㄥˇ
麤服亂頭 ㄘㄨ ㄈㄨˊ ㄌㄨㄢˋ ㄊㄡˊ
麤疏 cu1 shu1
麤糲 cu1 li4
齾齾 ya4 ya4
齉鼻兒 nang4 bi2 er1
㔩葉 e4 ye4
㟏岈 ㄏㄢ ㄒㄧㄚ
㥄遽 ㄌㄧㄥˊ ㄐㄩˋ
㥏墨 ㄊㄧㄢˇ ㄇㄛˋ
㩳身 ㄙㄨㄥˇ ㄕㄣ
㲯毿 ㄌㄢˊ ㄙㄢ
㶁㶁 ㄍㄨㄛˊ ㄍㄨㄛˊ
㶟水 ㄌㄟˊ ㄕㄨㄟˇ
䀇子 gu3 zi5
䈾箕 shao1 ji1
䍪羯 wa4 jie2
䫄外 chua4 wai4
䰐鬖 lan2 san1
䰖兒 zuan3 er1
䰰䰰 ru2 ru2
Sample Bash Command (macOS):
sed -i '' -f ./CONV-HYPY2BPMF.SED SAMPLE_INPUT.txt
Sample sed script: https://github.com/ShikiSuen/vchewing-lingual-data/blob/main/utilities/CONV-HYPY2BPMF.SED
Desired output:
鸞膠續斷 ㄌㄨㄢˊ ㄐㄧㄠ ㄒㄩˋ ㄉㄨㄢˋ
鸞駕 ㄌㄨㄢˊ ㄐㄧㄚˋ
鸞儔鳳侶 ㄌㄨㄢˊ ㄔㄡˊ ㄈㄥˋ ㄌㄩˇ
鸞輿 ㄌㄨㄢˊ ㄩˊ
鸞鵠停峙 ㄌㄨㄢˊ ㄏㄨˊ ㄊㄧㄥˊ ㄓˋ
鸞飄鳳泊 ㄌㄨㄢˊ ㄆㄧㄠ ㄈㄥˋ ㄅㄛˊ
鸞鶴 ㄌㄨㄢˊ ㄏㄜˋ
鸞鑑 ㄌㄨㄢˊ ㄐㄧㄢˋ
灩澦堆 ㄧㄢˋ ㄩˋ ㄉㄨㄟ
灩灩 ㄧㄢˋ ㄧㄢˋ
籲天 ㄩˋ ㄊㄧㄢ
籲求 ㄩˋ ㄑㄧㄡˊ
籲請 ㄩˋ ㄑㄧㄥˇ
麤服亂頭 ㄘㄨ ㄈㄨˊ ㄌㄨㄢˋ ㄊㄡˊ
麤疏 ㄘㄨ ㄕㄨ
麤糲 ㄘㄨ ㄌㄧˋ
齾齾 ㄧㄚˋ ㄧㄚˋ
齉鼻兒 ㄋㄤˋ ㄅㄧˊ ㄦ
㔩葉 ㄜˋ ㄧㄝˋ
㟏岈 ㄏㄢ ㄒㄧㄚ
㥄遽 ㄌㄧㄥˊ ㄐㄩˋ
㥏墨 ㄊㄧㄢˇ ㄇㄛˋ
㩳身 ㄙㄨㄥˇ ㄕㄣ
㲯毿 ㄌㄢˊ ㄙㄢ
㶁㶁 ㄍㄨㄛˊ ㄍㄨㄛˊ
㶟水 ㄌㄟˊ ㄕㄨㄟˇ
䀇子 ㄍㄨˇ ㄗ˙
䈾箕 ㄕㄠ ㄐㄧ
䍪羯 ㄨㄚˋ ㄐㄧㄝˊ
䫄外 ㄔㄨㄚˋ ㄨㄞˋ
䰐鬖 ㄌㄢˊ ㄙㄢ
䰖兒 ㄗㄨㄢˇ ㄦ
䰰䰰 ㄖㄨˊ ㄖㄨˊ
CodePudding user response:
If the order of the lines in the file does not matter.
# get lines to be translated
grep '[a-z0-9]' input.txt | sed -f CONV-HYPY2BPMF.SED > output.txt
# append lines that are not to be translated
grep -v '[a-z0-9]' input.txt >> output.txt