I have Tibetan word and it's POS, as below:
སྩོལ་ VERB
སྐབས་ NOUN
ཆོས་ NOUN
ཞུ་བ་ VERB
ཚོས་ NOUN
དེབ་ NOUN
how can i extract only the pos as shown :
VERB
NOUN
NOUN
VERB
NOUN
NOUN
code I tried :
file = # given input file containing word and pos
for line in file:
word = line.split(' ')[0]
pos = line.split(' ')[2]
above code is not showing the desired result, if you guys have any idea, would help alot!
CodePudding user response:
Try using list comprehension with string.ascii_uppercase
import string
text = """སྩོལ་ VERB
སྐབས་ NOUN
ཆོས་ NOUN
ཞུ་བ་ VERB
ཚོས་ NOUN
དེབ་ NOUN"""
for line in text.split('\n'):
POS = ''.join([i for i in line if i in string.ascii_uppercase])
print(POS)
or from a file:
with open(filename, encoding='utf8') as fd:
for line in fd:
POS = ''.join([i for i in line if i in string.ascii_uppercase])
print(POS)
output:
VERB
NOUN
NOUN
VERB
NOUN
NOUN
CodePudding user response:
You may have several spaces or tabulation between words and POS. I'd try to use sep=None
to split by '\s '
, see help(str.split)
. Also it seems reasonable to split from the right side and only one time:
data = '''སྩོལ་ VERB
སྐབས་ NOUN
ཆོས་ NOUN
ཞུ་བ་ VERB
ཚོས་ NOUN
དེབ་ NOUN'''
records = data.split('\n')
records = [rec.rstrip().rsplit(maxsplit=1) for rec in records]
POS = [r[-1] for r in records]
# as an option:
data = {word: pos for word, pos in records}
POS = [*data.values()] # =dict(records).values()