I am trying to create a training dataset for NER recognition. For that, I have huge amounts of data that need to be tagged and remove the unnecessary sentences. On removing the unnecessary sentence the index potion must be updated. Last day I saw some incredible code segments from some users about this which I cannot find now. Adapting their code segment I can brief my issue
Let's take a training sample data :
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
This can be visualized using the following spacy display code
import json
import spacy
from spacy import displacy
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
annot_tags = data[data_index]["annotations"]
entities = []
for j in annot_tags:
start = j["start"]
end = j["end"]
tag = j["tag"]
entitie = (start,end,tag)
entities.append(entitie)
data_gen = (data[data_index]["content"],{"entities":entities})
data_one = []
data_one.append(data_gen)
nlp = spacy.blank('en')
raw_text = data_one[0][0]
doc = nlp.make_doc(raw_text)
spans = data_one[0][1]["entities"]
ents = []
for span_start, span_end, label in spans:
ent = doc.char_span(span_start, span_end, label=label)
if ent is None:
continue
ents.append(ent)
doc.ents = ents
displacy.render(doc, style="ent", jupyter=True)
The output will be
Now I want to remove the sentence which is not tagged and update the index values. So the required output is like
Also data must be in the following format. Untagged sentence is removed and index values must be updated so that I can get the output like above.
Required output data
[{"content":'''Hello we are hans and john.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":42,"end":48,"tag":"fruit"},
{"id":4,"start":50,"end":56,"tag":"name"}]}]
I was following a post last day and got a near working code.
Code
import re
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
for idx, each in enumerate(data[0]['annotations']):
start = each['start']
end = each['end']
word = data[0]['content'][start:end]
data[0]['annotations'][idx]['word'] = word
sentences = [ {'sentence':x.strip() '.','checked':False} for x in data[0]['content'].split('.')]
new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
for idx_alpha, sentence in enumerate(sentences):
if sentence['checked'] == True:
continue
temp = each.copy()
check_word = temp['word']
if check_word in sentence['sentence']:
start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
end_idx = start_idx len(check_word)
current_len = len(new_data[0]['content'])
new_data[0]['content'] = sentence['sentence'] ' '
temp.update({'start':start_idx current_len, 'end':end_idx current_len})
new_data[0]['annotations'].append(temp)
sentences[idx_alpha]['checked'] = True
break
print(new_data)
Output
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. ',
'annotations': [{'id': 1,
'start': 13,
'end': 17,
'tag': 'name',
'word': 'hans'},
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}]}]
Here the name john is lost. If more than one tag is present, I can't lose that too
I know this will be a lot to ask. But any bit of help is appreciated
Thanks in Advance
Please upvote the question since I am beginner I can get more features to stackoverflow.
CodePudding user response:
It's a pretty complicated task, in that, you need to identify sentences, as doing a simple split on the '.'
may not work as it'll split on things like 'Mr.'
, etc.
Since you are using spacy, why not let that identify sentences, then iterate through those and calculate out those start end indexes, and not include any sentence that doesn't have an entity. Then reconstruct the content.
import json
import spacy
from spacy import displacy
import re
data = [{"content":'''Hello we are hans and john. I enjoy playing Football. \
I love eating grapes. Hanaan is great. Mr. Jones is nice.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"},
{"id":5,"start":93,"end":102,"tag":"name"}]}]
for idx, each in enumerate(data[0]['annotations']):
start = each['start']
end = each['end']
word = data[0]['content'][start:end]
data[0]['annotations'][idx]['word'] = word
text = data[0]['content']
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('sentencizer')
doc = nlp(text)
sentences = [i for i in doc.sents]
annotations = data[0]['annotations']
new_data = [{"content":'',
'annotations':[]}]
for sentence in sentences:
idx_to_remove = []
for idx, annotation in enumerate(annotations):
if annotation['word'] in sentence.text:
temp = annotation.copy()
start_idx = re.search(r'\b({})\b'.format(annotation['word']), sentence.text).start()
end_idx = start_idx len(annotation['word'])
current_len = len(new_data[0]['content'])
temp.update({'start':start_idx current_len, 'end':end_idx current_len})
new_data[0]['annotations'].append(temp)
idx_to_remove.append(idx)
if len(idx_to_remove) > 0:
new_data[0]['content'] = sentence.text ' '
for x in range(0,len(idx_to_remove)):
del annotations[0]
Output:
print(new_data)
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. Mr. Jones is nice. ',
'annotations': [
{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'},
{'id': 2, 'start': 22, 'end': 26, 'tag': 'name', 'word': 'john'},
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'},
{'id': 5, 'start': 67, 'end': 76, 'tag': 'name', 'word': 'Mr. Jones'}]}]
CodePudding user response:
Just delete
#sentences[idx_alpha]['checked'] = True
#break
Output
[{'content': 'Hello we are hans and john. Hello we are hans and john. I love eating grapes. Hanaan is great. ',
'annotations':
[{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'},
{'id': 2, 'start': 50, 'end': 54, 'tag': 'name', 'word': 'john'},
{'id': 3, 'start': 70, 'end': 76, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 78, 'end': 84, 'tag': 'name', 'word': 'Hanaan'}]}]