I'm working on an automated solution to training a binary relevance multilabel classification model in Python. I'm using skmultilearn
with key elements being a TFIDF vectorizer and the BinaryRelevance(MultinomialNB())
function.
I'm running into accuracy problems and need to improve the quality of my training data.
This is very labour intensive (reading or manually filtering hundreds of news articles in Excel) so I'm looking for ways to automate it. My data comes from a university database where I search for articles relevant to what I'm studying. My end goal is to assign six labels to all articles where an article can have zero, one or multiple labels. My current idea for producing training data quickly is to search the university database using criteria for each label, then tagging it to produce something that looks like this:
ID | Title | Full Text | Label 1 | Label 2 | Search Criteria |
---|---|---|---|---|---|
0 | Article 1 | blahblah | 1 | 0 | Search terms associated with label 1 |
1 | Article 2 | blah | 1 | 0 | Search terms associated with label 1 |
2 | Article 2 | blah | 0 | 1 | Search terms associated with label 2 |
3 | Article 4 | balala | 0 | 1 | Search terms associated with label 2 |
4 | Article 5 | baaa | 0 | 1 | Search terms associated with label 2 |
Doing this will return the same article numerous times where it has multiple labels. This is shown above for article 2 which meets the search criteria for both label 1 and 2. I now need to consolidate such instances to this:
ID | Title | Full Text | Label 1 | Label 2 |
---|---|---|---|---|
1 | Article 2 | blah | 1 | 1 |
Instead of this:
ID | Title | Full Text | Label 1 | Label 2 | Search Criteria |
---|---|---|---|---|---|
1 | Article 2 | blah | 1 | 0 | label 1 |
2 | Article 2 | blah | 0 | 1 | label 2 |
I'm very new to Python data processing. I've explored Python for the first time to explore its NLP packages. Any ideas on how to go about solving this problem? Is there some pandas dataframe functionality that I could use?
CodePudding user response:
Try this:
df.groupby('Title').agg('max').reset_index().drop('Search Criteria', axis=1)
Before:
ID Title Full Text Label 1 Label 2 Search Criteria
0 0 Article 1 blahblah 1 0 Search terms associated with label 1
1 1 Article 2 blah 1 0 Search terms associated with label 1
2 2 Article 2 blah 0 1 Search terms associated with label 2
3 3 Article 4 balala 0 1 Search terms associated with label 2
4 4 Article 5 baaa 0 1 Search terms associated with label 2
After:
Title ID Full Text Label 1 Label 2
0 Article 1 0 blahblah 1 0
1 Article 2 2 blah 1 1 <----- Notice that there is only one "Article 2" row, and "Label 1" and "Label 2" are both 1
2 Article 4 3 balala 0 1
3 Article 5 4 baaa 0 1