Multiple theories but no answer for a Pandas SettingWithCopyWarning-CodePudding

I have a long for loop that ends with inserting data into a specific column for each row in a DataFrame:

for n in range(df_length):
    keywords_list = []
    keywords_dict = {}

    try:
        text_for_ner = str(str(unique_new_articles['first_para'][n].split("--",1)[1]   "\n"   unique_new_articles['headline'][n]))
    except:
        pass

    doc = nlp(text_for_ner)

    for token in doc.ents:
        if token.label_ == 'GPE' or token.label_ == 'LOC' or token.label_ == 'ORG' or token.label_ == 'PRODUCT' or token.label_ == 'FAC' or token.label_ == 'NORP' or token.label_ == 'PERSON' or token.label_ == 'EVENT' or token.label_ == 'LAW' or token.label_ == 'WORK_OF_ART':
            keywords_list.append(str(token))

    keywords_list = [word.replace('the ', '') if word.startswith('the') else word for word in keywords_list]
    keywords_list = [word.replace('The ', '') if word.startswith('The') else word for word in keywords_list]
    keywords_list = [word.replace("'s", "") if word.endswith("'s") else word for word in keywords_list]
    keywords_list = [word.replace("'", "") if word.endswith("s'") else word for word in keywords_list]

    keywords_dict = Counter(keywords_list)
        
    unique_new_articles['keywords'][n] = keywords_dict

I'm getting this warning though:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unique_new_articles['keywords'][n] = keywords_dict

When the DataFrame is small (I tested it with 10 rows), I get no such error. But when running the code on my full DataFrame (6,000 rows), I get the warning message. This makes me think the issue is related to DataFrame size and Pandas is being forced to treat things differently in memory.

But then I read this from Pandas: "You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect." This sounds relevant... but ultimately I'm not iterating over the DataFrame, right? I'm iterating over a range/integer object instead.

Or, maybe the problem is because of how I'm chaining location conditions (ie: unique_new_articles['keywords'][n])?

Basically, I see several potential spots where the issue is arising, but I don't know how to fix any of them. Anyone have any insights?

CodePudding user response：

Your code simplifies to a df.apply form like follows, and it's possibly faster too.

from collections import Counter

allowed_labels = {
    "GPE",
    "LOC",
    "ORG",
    "PRODUCT",
    "FAC",
    "NORP",
    "PERSON",
    "EVENT",
    "LAW",
    "WORK_OF_ART",
}


def clean_token(token):
    token = str(token)
    if token.startswith("the"):
        token = token[3:]
    if token.startswith("The"):
        token = token[3:]
    if token.endswith("'s"):
        token = token[:-2]
    if token.endswith("'"):
        token = token[:-1]
    return token.strip()


def process_keywords(row):
    keywords_list = []
    text_for_ner = str(
        str(row.first_para.split("--", 1)[1]   "\n"   row.headline)
    )

    doc = nlp(text_for_ner)

    for token in doc.ents:
        if token.label_ in allowed_labels:
            keywords_list.append(clean_token(token))

    return Counter(keywords_list)


# ...

unique_new_articles["keywords"] = unique_new_articles.apply(
    process_keywords, axis=1
)