How to filter a row and remove those values from a column?-CodePudding

I have the next DataFrame:

x = [{'name': 'a.1,b,c,d,e,a,f,g,h'}, {'name': 'b.1,c,a.1,d,e.1,g,a,h'}, {'name': 'b.1,d,e,a,f.1,c,r'}]
df = pd.DataFrame(x)

I need to filter column -> 'name' and delete the values a and a.1, and to get the next result:

I will be grateful for help

CodePudding user response：

You can use list comprehension:

df['name_1'] = [','.join([a for a in d.split(',') if a not in set(('a','a.1'))]) for d in df['name']]

You can also apply the very same function to the 'name' column:

df['name_1'] = df['name'].apply(lambda row : ','.join([a for a in row.split(',') if a not in set(('a','a.1'))]))

Output:

                    name           name_1
0    a.1,b,c,d,e,a,f,g,h    b,c,d,e,f,g,h
1  b.1,c,a.1,d,e.1,g,a,h  b.1,c,d,e.1,g,h
2      b.1,d,e,a,f.1,c,r  b.1,d,e,f.1,c,r

CodePudding user response：

You can use Dataframe's apply function this:

def update(r):
    items = r.split(',')
    return ','.join(filter(lambda x: x != 'a.1' and x != 'a', items))
df['name_1'] = df['name'].apply(update)
df

Output:

    name    name_1
0   a.1,b,c,d,e,a,f,g,h     b,c,d,e,f,g,h
1   b.1,c,a.1,d,e.1,g,a,h   b.1,c,d,e.1,g,h
2   b.1,d,e,a,f.1,c,r   b.1,d,e,f.1,c,r

CodePudding user response：

You can just use replace:

df['name'] = df['name'].replace({'a.1,':'','a':''},regex=True).str.replace(',,',',')

              name
0    b,c,d,e,f,g,h
1  b.1,c,d,e.1,g,h
2  b.1,d,e,f.1,c,r

CodePudding user response：

Another regex option:

df['name'].str.replace(r"a\.1|a","").str.replace(r'(?<=,),|^,',"")

0      b,c,d,e,f,g,h
1    b.1,c,d,e.1,g,h
2    b.1,d,e,f.1,c,r

CodePudding user response：

Try this:

df['name_1'] = df['name'].str.split(',').apply(lambda cell_value: ",".join(filter(lambda x: x not in {'a', 'a.1'}, cell_value)))

CodePudding user response：

I am surprised that all proposed regex answers use two passes to achieve the goal.

Here is a regex answer that relies on a single expression:

# list of substrings to remove
# IMPORTANT: if an element is a left-aligned substring of another
# the substring should be ordered after the longer string
# here "a" is after "a.1"

remove = ['a.1', 'a']

import re
regex = '({0}),|,({0})'.format('|'.join(map(re.escape, remove)))
df['name_1'] = df['name'].str.replace(regex, '', regex=True)

Alternative if group capturing is unwanted:

regex = '(?:{0}),|,(?:{0})'.format('|'.join(map(re.escape, remove)))

Example:

                      name           name_1
0      a.1,b,c,d,e,a,f,g,h    b,c,d,e,f,g,h
1  b.1,c,a.1,d,e.1,g,a,h,a  b.1,c,d,e.1,g,h
2        b.1,d,e,a,f.1,c,r  b.1,d,e,f.1,c,r

Format of the generated regex:

'(a\\.1|a),|,(a\\.1|a)'

# or for the alternative
'(?:a\\.1|a),|,(?:a\\.1|a)'

CodePudding user response：

This might help:

df['name1'] = [x.replace("a.1", "") for x in df.name]
df['name1'] = [x.replace("a", "") for x in df.name1]