Home > Net >  How to delete specific values from a list-column in pandas
How to delete specific values from a list-column in pandas

Time:12-06

I've used POS-tagging (in german language, thus nouns have "NN" and "NE" as abbreviations) and now I am having trouble to extract the nouns into a new column of the pandas dataframe.

Example:

data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)
df
df["nouns"] = df["tagged"].apply(lambda x: [word for word, tag in x if tag in ["NN", "NE"]])

Results in the following error message: "ValueError: too many values to unpack (expected 2)"

I think the code would work if I was able to delete the first value of each tagged word but I cannot figure out how to do that.

CodePudding user response:

Because there are tuples with 3 values unpack values to variables word1 and word2:

df["nouns"] = df["tagged"].apply(lambda x: [word2 for word1, word2, tag 
                                                         in x if tag in ["NN", "NE"]])

Or use same solution in list comprehension:

df["nouns"] = [[word2 for word1,word2, tag in x if tag in ["NN", "NE"]]
                for x in df["tagged"]]

print (df)
                                         tagged          nouns
0        [(waffe, Waffe, NN), (haus, Haus, NN)]  [Waffe, Haus]
1  [(groß, groß, ADJD), (bereich, Bereich, NN)]      [Bereich]

CodePudding user response:

I think it would be easier with function call. This creates list of NN or NE tags from each row. If you would like to deduplicate, you need to update the function.

data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)

#function
def getNoun(obj):
    ret=[] #declare empty list as default value
    for l in obj: #iterate list of word groups
        for tag in l: #iterate list of words/tags
            if tag in ['NN','NE']:
                ret.append(tag) #add to return list
    return ret

#call new column creation
df['noun']=df['tagged'].apply(getNoun)

#result
print(df['noun'])

#output:
#0    [NN, NN]
#1        [NN]
#Name: noun, dtype: object
  • Related