Home > database >  How to combine consecutive strings based on values in another list and conditions?
How to combine consecutive strings based on values in another list and conditions?

Time:10-19

I have 2 lists:

tokens = ['[CLS]', 'Thinking', 'historically', 'is', ',', 'first', ',', 'an', 'attitude', 'acknowledging', 'that', 'every', 'event', 'can', 'be', 'meaningful', '##ly', 'understood', 'only', 'in', 'relation', 'to', 'previous', 'events', ',', 'and', ',', 'second', ',', 'the', 'method', '##ical', 'application', 'of', 'this', 'attitude', ',', 'which', 'en', '##tails', 'both', 'analyzing', 'events', 'context', '##ually', '-', '-', 'as', 'having', 'occurred', 'in', 'the', 'midst', 'of', 'pre', '-', 'existing', 'circumstances', '-', '-', 'and', 'comprehend', '##ing', 'them', 'from', 'historical', 'actors', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
labels = [0, 0, 0, 0, 0, 2, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0]

I also have a dictionary that maps the labels to their meaning:

labels_meaning = {}
labels_meaning[0] = 'subject'
labels_meaning[1] = 'relation'
labels_meaning[2] = 'object'
labels_meaning[3] = 'na'

The goal is to place each string in their corresponding labels_list (ignoring na):

subjects = []
relations = []
objects = []

There are 3 conditions:

  1. Combining tokens that have consecutive label (e.g., 0, 0, 0) into one string. e.g., the first 5 labels are 0s, hence the first string should be "[CLS] Thinking historically is ,", which should be appended to the corresponding labels_list: subjects.append(string)
  2. If a token has the string "##" in it, it should be concatenated with the previous string without spaces. e.g., "meaningful", "##ly" --> "meaningfully". Assuming they have the same label. Otherwise the "##" should be removed and the string should be appended to the corresponding labels_list: subjects.append("ly")
  3. A few tokens should be ignored: [CLS], [SEP], [PAD]

Update:

Adding my attempt, but I'm stuck on combining the consecutive tokens

labels_meaning = {}
labels_meaning[0] = 'subject'
labels_meaning[1] = 'relation'
labels_meaning[2] = 'object'
labels_meaning[3] = 'na'
ignore = ['[CLS]', '[SEP]', '[PAD]']

def get_sentence_triples_from_token_labels(tokens, token_labels):
    for tok, label in zip(tokens, token_labels):
        current_label = label
        if tok == '[CLS]': # initialize
            previous_label = current_label
            prev = False
            current_string = ''
        if tok not in ignore:
            if previous_label != current_label and prev==True:
                current_string = f'{tok} ' 
                pass
                
            else:
                pass
            
            prev = True


        break


get_sentence_triples_from_token_labels(tokens, labels)

CodePudding user response:

solution

Not sure if this is what you wan.

labels_meaning = { 0:'subject', 1:'relation', 2:'object', 3:'na' }
ignore = ['[CLS]', '[SEP]', '[PAD]']


tokens = ['[CLS]', 'Thinking', 'historically', 'is', ',', 'first', ',', 'an', 'attitude', 'acknowledging', 'that', 'every', 'event', 'can', 'be', 'meaningful', '##ly', 'understood', 'only', 'in', 'relation', 'to', 'previous', 'events', ',', 'and', ',', 'second', ',', 'the', 'method', '##ical', 'application', 'of', 'this', 'attitude', ',', 'which', 'en', '##tails', 'both', 'analyzing', 'events', 'context', '##ually', '-', '-', 'as', 'having', 'occurred', 'in', 'the', 'midst', 'of', 'pre', '-', 'existing', 'circumstances', '-', '-', 'and', 'comprehend', '##ing', 'them', 'from', 'historical', 'actors', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
labels = [0, 0, 0, 0, 0, 2, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0]

subjects = []
relations = []
objects = []

def get_sentence_triples_from_token_labels(tokens, token_labels):
    dicRlt = {lab:[] for lab in [0,1,2,3]}
    last_label = token_labels[0]
    for tok, label in zip(tokens, token_labels):
        if tok not in ignore:
            if last_label != label:
                if label == 0: 
                    subjects.append(" ".join(dicRlt[0]).replace(" ##",""))
                elif label == 1:
                    relations.append(" ".join(dicRlt[1]).replace(" ##",""))
                elif label == 2: 
                    objects.append(" ".join(dicRlt[2]).replace(" ##",""))
                dicRlt[label]=[]
            dicRlt[label].append(tok)                
            last_label = label
    subjects.append(" ".join(dicRlt[0]).replace(" ##",""))
    relations.append(" ".join(dicRlt[1]).replace(" ##",""))
    objects.append(" ".join(dicRlt[2]).replace(" ##",""))
    return 
  • Test

    print(subjects)
    print(relations)
    print(objects)
    
  • Outpu:

    ['Thinking historically is ,', 'attitude', 'that every event can be meaningfully understood only', 'relation', 'previous events ,', ', second', 'methodical', 'attitude', 'entails', 'events contextually -', 'as having occurred in', 'circumstances -', 'comprehend', 'them from', 'actors']
    ['', ', an', 'acknowledging', 'in', 'to', 'and', ', the', 'application of this', ', which', 'both analyzing', '-', 'the midst of pre - existing', '- and', '##ing', 'historical']
    ['', 'first']
    
  • Related