Deleting the sentence and updating the index-CodePudding

I am working on a data format like this.

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]

and i did want a data format like this. The sentences which do not have any entities has to be removed. And update the start and end of other entities according to the removed sentence.

result_data = data = [{"content":'''Hello I am Aniyya. I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":33,"end":39,"tag":"fruit"}]}]

I am not getting any particular logic for this. I know this is like asking to code for me, but if any of have time to help me with this i appreciate a lot. i kind of stuck at this. There is a similar type question from me asked previously but it also didnt worked out at me. So thought of describe more details. Solution for this will be helpful for all those who are preparing the dataset related to NLP tasks. Thanks in advance.

Visualization is done with spacy displacy, Code is in visualizing NER training data and entity using displacy

CodePudding user response：

import re

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes. Aniyya is great.''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"},
                                {"id":3,"start":67,"end":73,"tag":"name"}]}]
         
         
         
for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
sentences = [ {'sentence':x.strip()   '.','checked':False} for x in data[0]['content'].split('.')]

new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
    for idx_alpha, sentence in enumerate(sentences):
        if sentence['checked'] == True:
            continue
        temp = each.copy()
        check_word = temp['word']
        if check_word in sentence['sentence']:
            start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
            end_idx = start_idx   len(check_word)
            
            current_len = len(new_data[0]['content'])
            
            new_data[0]['content']  = sentence['sentence']   ' '
            temp.update({'start':start_idx   current_len, 'end':end_idx   current_len})
            new_data[0]['annotations'].append(temp)
            
            sentences[idx_alpha]['checked'] = True
            break

Output:

print(new_data)
[{'content': 'Hello I am Aniyya. I love eating grapes. Aniyya is great. ', 'annotations': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, 'start': 33, 'end': 39, 'tag': 'fruit', 'word': 'grapes'}, {'id': 3, 'start': 41, 'end': 47, 'tag': 'name', 'word': 'Aniyya'}]}]

CodePudding user response：

From What I see in the Question is that there is a delimiter to Separate a Sentence which is '.' (DOT). In that way, u can separate the sentences into different Units, and then for each sentence, u can try checking if it's a valid sentence with annotation available or not, Else delete or splice that sentence from the string.

I've written a draft of a solution for the same, it's getting the job done. Feel free to suggest any change. Also u probably need to tune it to your exact requirement

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]
identifier = '#'

def processRow(row):
    annotations = row["annotations"]
    temp = row["content"]
    startIndex = 0;
    endIndex = 0;
    annotationMap = dict()
    for annotation in annotations:
        start = annotation["start"]
        end = annotation["end"] - 1
        temp = temp[:end]   identifier   temp[end 1:]
        
    result = ""
    temp = temp.split(".")
    content = row["content"].split(".")
    
    for tempRow,row in zip(temp,content):
        if identifier in tempRow:
            result = result   row   "."
            
    return result

def processData(data):
    for row in data:
        temp = processRow(row)
        row["content"] = temp
    print(data)
    
    
processData(data)