So I have two files. One yaml file that contains tibetan words : its meaning
. Another csv file that contains only word and it's POStag. As below:
yaml file :
ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།
csv file :
ད་ཆུ PART
ད་གདོད DET
Desired output:
ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།
Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?
CodePudding user response:
On a functional point of view, you have:
- a dictionary, meaning here a key: value thing
- a list of words to search in that dictionary, and that will produce a record
If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm
(or shelve
) module as an on disk dictionary.
As you have not shown any code, I cannot either... I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml
module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split
to extract the key and the value.
CodePudding user response:
The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:
YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།\nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན\nད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན\nད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།\nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("\n")
CSV_LINES = "ད་ཆུ\nད་གདོད".split("\n")
for line in YAML_LINES:
word, meaning = line.split(": ")
if word in CSV_LINES:
output = word " " meaning
print(output)
The YAML_LINES
and CSV_LINES
lists are only to provide a quick and dirty example.
CodePudding user response:
Assuming your files are called dict.yml
and input.csv
.
You can start by turning the yaml file into a dictionary with
import yaml
with open('dict.yaml', 'r') as file:
trans_dict = yaml.safe_load(file)
Which should give you
>>> trans_dict
{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}
Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:
outputs = []
with open('text.txt', 'r') as file:
for line in file:
term = line.strip()
definition = trans_dict.get(term.strip())
outputs.append(
term if definition is None
else f"{term} {definition}"
)
From here, your outputs
variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']
. If you optionally wanted to write this out to a file, you could do
with open('output.txt', 'w') as file:
file.write('\n'.join(outputs))
If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you'd be able to apply basically the same approach.