Home > Blockchain >  Extracting important entities from unstructured data
Extracting important entities from unstructured data

Time:03-09

I am working on a NLP problem where I am completely stuck at certain point. I am new to these so pardon if the question is dumb. I have got a completely unstructured text let's say: "a person named x y is travelling to country ab, he spent xyz (alpha/currency/beta/gamma), ate a b c d e f food items and many more." now I have to extract

|name of person| country's name | amount spent and the currency | food items he ate | place of              
stay|

Constraint on this is, the text contains some false information, for example: the food b and c cannot be found in a particular country, and thus it should not be extracted. I have a nested dictionary which looks like this:

{country_name: {place 1: {name of hotels:[hotel1, hotel2, hotel3....],
                          eatables: [food1, food2, food3, food4.....],
                          currency_accepted: [c1, c2, c3, c4.......],
                          }
                }
} 

I want to use this dictionary in the unstructured text so that I can parse the data and extract entities which are relevant in separate columns of dataframe. I have seen NER based approaches, but I guess it requires tagging of words, and I have got huge data.

I have tried regex based approach for pattern matching, but that doesn't give all the results, further to that I have tried to match all the entities stored in a list, but this creates the problem of many false entities being extracted and accuracy is quite important here.

I am looking for more improve parsing based approaches, also if there is any way a certain model is trained on this dictionary such that it looks for values of nested dictionary only if a key is found in the unstructured text.

CodePudding user response:

Before you go to machine learning, you could try using fuzzywuzzy. I had a similar problem at work and was able to achieve high accuracy by adjusting the ratio attribute. So, for each extracted entity, you would have to run it through fuzzywuzzy and your dictionary.

For the issue of

but this creates the problem of many false entities being extracted

I would implement a filter: if the extracted & matched entity is not in the list, leave the extracted entity out, otherwise, continue with the logic.

  • Related