Given a list [[["source1"], ["target1"], ["alignment1"]], ["source2"], ["target2"], ["alignment2"]], ...]
, I want to extract the words in the source that align with the words in the target.
For example, in the English-German sentence pair The hat is on the table . - Der Hut liegt auf dem Tisch ., I want to print the following:
The - Der
hat - Hut
is - liegt
on - auf
the - dem
table - Tisch
. - .
So I have written the following:
en_de = [
[['The', 'hat', 'is', 'on', 'the', 'table', '.'], ['Der', 'Hut', 'liegt', 'auf', 'dem', 'Tisch', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'],
[['The', 'picture', 'is', 'on', 'the', 'wall', '.'], ['Das', 'Bild', 'hängt', 'an', 'der', 'Wand', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'],
[['The', 'bottle', 'is', 'under', 'the', 'sink', '.'], ['Die', 'Flasche', 'ist', 'under', 'dem', 'Waschbecken', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6']
]
for group in en_de:
src_sent = group[0]
tgt_sent = group[1]
aligns = group[2]
split_aligns = aligns.split()
hyphen_split = [align.split("-") for align in split_aligns]
align_index = hyphen_split[0]
print(src_sent[int(align_index[0])],"-", tgt_sent[int(align_index[1])])
This prints, as expected, the words in index position 0 of src_sent
and tgt_sent
:
The - Der
The - Das
The - Die
Now, I don't know how I can print the words of all index positions of src_sent
and tgt_sent
. Obviously, I could manually update align_index
to a new index position for each position in the sentence pair, but on the full dataset, some sentences will have up to 25 index positions.
Is there a way to possibly for-loop through each index position?
When I try:
align_index = hyphen_split[0:]
print(src_sent[int(align_index[0])],"-", tgt_sent[int(align_index[1])])
I get a TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
It's clear that align_index
can't be a list, but I'm not sure how to convert it into something that will do what I want it to do.
Any advice or help would be greatly appreciated. Thank you in advance.
CodePudding user response:
IIUC you want this:
en_de = [
[['The', 'hat', 'is', 'on', 'the', 'table', '.'], ['Der', 'Hut', 'liegt', 'auf', 'dem', 'Tisch', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'],
[['The', 'picture', 'is', 'on', 'the', 'wall', '.'], ['Das', 'Bild', 'hängt', 'an', 'der', 'Wand', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'],
[['The', 'bottle', 'is', 'under', 'the', 'sink', '.'], ['Die', 'Flasche', 'ist', 'under', 'dem', 'Waschbecken', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6']
]
for sentences in en_de:
for en, de in zip(*sentences[:2]):
print(f'{en} - {de}')
Printing pairs of English and German for each sentence. If they are always in pairs this should work. So if the alignment is always linear it is not necessary to have it at all.
If the alignment is not always going to be linear, you would need to account for that too:
en_de = [
[['The', 'hat', 'is', 'on', 'the', 'table', '.'], ['Der', 'Hut', 'liegt', 'auf', 'dem', 'Tisch', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'],
[['The', 'picture', 'is', 'on', 'the', 'wall', '.'], ['Das', 'Bild', 'hängt', 'an', 'der', 'Wand', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'],
[['The', 'bottle', 'is', 'under', 'the', 'sink', '.'], ['Die', 'Flasche', 'ist', 'under', 'dem', 'Waschbecken', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6']
]
for sentences in en_de:
# alternative to the below for loop
# alignment = [(int(a), int(b)) for a, b in [p.split('-') for p in sentences[2].split()]]
alignment = []
for pair in sentences[2].split():
e, g = pair.split('-')
alignment.append((int(e), int(g)))
english = [sentences[0][i] for i, _ in alignment]
german = [sentences[1][i] for _, i in alignment]
for en, ge in zip(english, german):
print(f'{en} - {ge}')
CodePudding user response:
You are forgetting to loop over your hyphen_split
list:
for group in en_de:
src_sent = group[0]
tgt_sent = group[1]
aligns = group[2]
split_aligns = aligns.split()
hyphen_split = [align.split("-") for align in split_aligns]
for align_index in hyphen_split:
print(src_sent[int(align_index[0])],"-", tgt_sent[int(align_index[1])])
See the last two lines, updated from your code.