I have two lists, one containing true values selected by humans and a second list with extracted values. I would like to measure how well the pipeline is performing based on how many true values are contained in the extracted list. Example:
extracted_value = ["value", "of", "words", "that", "were", "tracked"]
real_value = ["value", "words", "that"]
I need a metric that describes: 3 out of 3 real values were extracted
For multiple Documents: 5 out of 10 real values were extracted 2 out of 3 real values were extracted 1 out of 9 real values were extracted
Based on the individual comparison, can I get a score that describes how well the extracted keywords perform on average across all documents?
CodePudding user response:
Will something simple like this work?
score = len([x for x in real_value if x in extracted_value])/len(extracted_value)
print(score)
>>> 0.5
CodePudding user response:
The metric you're looking for is recall.
@sfat's solution works well for a single document, you can then get the average over multiple documents by sum
ming the scores and then dividing by the len
of documents.
For more advanced scoring for your retrieval, check the F-Score section of the linked article.
CodePudding user response:
To check how many values are shared between extracted_value and real_value. I believe you're looking for the recall of your model, you can use set operations, specifically & (and) divided by your ground truth (real_values):
recall = len(set(real_value) & set(extracted_value))/len(real_values)
or if you want exactly which specific values are shared, which you could always take the len
of:
shared_vals = set(real_value) & set(extracted_value)
If you want to then calculate recall with shared_vals:
recall = len(shared_vals)/len(real_value)