I have a long for loop that ends with inserting data into a specific column for each row in a DataFrame:
for n in range(df_length):
keywords_list = []
keywords_dict = {}
try:
text_for_ner = str(str(unique_new_articles['first_para'][n].split("--",1)[1] "\n" unique_new_articles['headline'][n]))
except:
pass
doc = nlp(text_for_ner)
for token in doc.ents:
if token.label_ == 'GPE' or token.label_ == 'LOC' or token.label_ == 'ORG' or token.label_ == 'PRODUCT' or token.label_ == 'FAC' or token.label_ == 'NORP' or token.label_ == 'PERSON' or token.label_ == 'EVENT' or token.label_ == 'LAW' or token.label_ == 'WORK_OF_ART':
keywords_list.append(str(token))
keywords_list = [word.replace('the ', '') if word.startswith('the') else word for word in keywords_list]
keywords_list = [word.replace('The ', '') if word.startswith('The') else word for word in keywords_list]
keywords_list = [word.replace("'s", "") if word.endswith("'s") else word for word in keywords_list]
keywords_list = [word.replace("'", "") if word.endswith("s'") else word for word in keywords_list]
keywords_dict = Counter(keywords_list)
unique_new_articles['keywords'][n] = keywords_dict
I'm getting this warning though:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unique_new_articles['keywords'][n] = keywords_dict
When the DataFrame is small (I tested it with 10 rows), I get no such error. But when running the code on my full DataFrame (6,000 rows), I get the warning message. This makes me think the issue is related to DataFrame size and Pandas is being forced to treat things differently in memory.
But then I read this from Pandas: "You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect." This sounds relevant... but ultimately I'm not iterating over the DataFrame, right? I'm iterating over a range/integer object instead.
Or, maybe the problem is because of how I'm chaining location conditions (ie: unique_new_articles['keywords'][n])?
Basically, I see several potential spots where the issue is arising, but I don't know how to fix any of them. Anyone have any insights?
CodePudding user response:
Your code simplifies to a df.apply
form like follows, and it's possibly faster too.
from collections import Counter
allowed_labels = {
"GPE",
"LOC",
"ORG",
"PRODUCT",
"FAC",
"NORP",
"PERSON",
"EVENT",
"LAW",
"WORK_OF_ART",
}
def clean_token(token):
token = str(token)
if token.startswith("the"):
token = token[3:]
if token.startswith("The"):
token = token[3:]
if token.endswith("'s"):
token = token[:-2]
if token.endswith("'"):
token = token[:-1]
return token.strip()
def process_keywords(row):
keywords_list = []
text_for_ner = str(
str(row.first_para.split("--", 1)[1] "\n" row.headline)
)
doc = nlp(text_for_ner)
for token in doc.ents:
if token.label_ in allowed_labels:
keywords_list.append(clean_token(token))
return Counter(keywords_list)
# ...
unique_new_articles["keywords"] = unique_new_articles.apply(
process_keywords, axis=1
)