Removing multiple occurrence of an element from the given dataset-CodePudding

I have a dataset as follows:

Name	Elements
Cat	friend, friend, friend
Dog	friend, friend
Crow	friend
Cow	friend, domestic
Parrot	friend, friend, domestic, domestic
Rabbit	domestic

I have to remove all the rows containing only the occurrence of friend in element column. That is my final output should look like:

Name	Elements
Cow	friend, domestic
Parrot	friend, friend, domestic, domestic
Rabbit	domestic

I tried it by creating a list and then removing it using the following method, the code is as follows:

list = list(data["Elements"])
list[:] = [x for x in list if x != 'friend']

But using the above code only one instance, containing friend once i.e. Element corresponding to crow gets deleted and am unable to map the remaining data to the corresponding 'Name' Column.

How to remove all the instances of friend i.e. the elements corresponding to Cat, Dog and Crow. Also how would I be able to map the data to the corresponding 'Name' Column?

Any other methods?

Please guide.

CodePudding user response：

I am going to assume that there are possibly going to be more types of elements and possibly rows with no elements.

I am assuming that df['elements'] is a string (since you don't give a code example to see the formatting). The list will be [item.strip() for item in df['elements'].split(',')].

I would then convert the list to a set and see if is {'friend'}

def make_set(s):
    return set([item.strip() for item in s.split(',')])
cond = df['elements'].apply(lambda s: make_set(s) != {'friend'})
df[cond]

CodePudding user response：

You could convert the list of elements to a set, which is a object that is similar to a list but can't have duplicates (among other features). replacing it with

list[:] = [x for x in list if set(x) != set('friend')]

In order to remove the corresponding name, it might be better to not use list comprehension and instead use a for loop, like so (Output in form of two lists)

for name, element in zip(data["Name"], data["Elements"]):
    if set(element.split(", ")) != set(["friend"]) and element != "friend":
         cleanList.append((name,element))

cleanList = list(zip(*cleanList))

print(cleanList)

CodePudding user response：

Try this:

ix = df['Elements'].str.split(r',\s*', regex=True).apply(set) != {'friend'}
newdf = df.loc[ix].reset_index(drop=True).copy()

>>> newdf
     Name                            Elements
0     Cow                    friend, domestic
1  Parrot  friend, friend, domestic, domestic
2  Rabbit                            domestic

Note: if you are using pandas < 1.4 (you shouldn't), then remove the regex=True argument:

ix = df['Elements'].str.split(r',\s*').apply(set) != {'friend'}

Explanations

The first part str.split(...) splits the text values in Elements into lists. We apply set to those lists to obtain sets. And then we make a mask ix where these sets are different than the singleton {'friend'}.

The second part selects according to that mask, resets the RangeIndex so that you get rows 0, 1, 2.... The final bit .copy() is to ensure that the newdf has its own data and is not merely a slice in the original. This helps if you are going to modify it further.