How to remove text which repeats in multiple items in a python list?-CodePudding

I wonder if anyone could please help. I have a python list consisting of antibody names:

['anti-human CD86',
 'anti-human CD274 (B7-H1, PD-L1)',
 'anti-human CD270 (HVEM, TR2)',
...
 'anti-human CD155 (PVR)',
 'anti-human CD112 (Nectin-2)',
 'anti-human CD47']

I want to remove the 'anti-human ' part so I just have a list of the actual protein targets e.g. [CD86, CD274 ... CD47].

I've tried multiple methods, including:

for i in parsed_protein_names:
    i.split('anti-human ')

But don't seem to be getting anywhere. Could anyone please advise?

CodePudding user response：

A simple list comprehension with replace() will do

>>> antibodies
['anti-human CD86', 'anti-human CD274 (B7-H1, PD-L1)', 'anti-human CD270 (HVEM, TR2)']
>>> [e.replace("anti-human ", "") for e in antibodies]
['CD86', 'CD274 (B7-H1, PD-L1)', 'CD270 (HVEM, TR2)']

CodePudding user response：

Assuming your list is defined as follows:

parsed_protein_names = ['anti-human CD86',
                        'anti-human CD274 (B7-H1, PD-L1)',
                        'anti-human CD270 (HVEM, TR2)',
                        '...',
                        'anti-human CD155 (PVR)',
                        'anti-human CD112 (Nectin-2)',
                        'anti-human CD47']

You have a few different options with a list comprehension that you can use.

`str.replace`

result_list = [n.replace('anti-human ', '', 1) for n in parsed_protein_names]
print(result_list)

`str.split`

result_list = [n.split('anti-human', 1)[-1].lstrip() for n in parsed_protein_names]
print(result_list)

Here is the output, in any case:

['CD86', 'CD274 (B7-H1, PD-L1)', 'CD270 (HVEM, TR2)', '...', 'CD155 (PVR)', 'CD112 (Nectin-2)', 'CD47']

CodePudding user response：

the function you are looking for is "lstrip" and not "split"

here is a code that should be working

mylist = ['anti-human CD86','anti-human CD274 (B7-H1, PD-L1)','anti-human CD270 (HVEM, TR2)','anti-human CD155 (PVR)','anti-human CD112 (Nectin-2)','anti-human CD47']

my_output_list = []

for i in mylist:
    a = i.lstrip('anti-human')
    my_output_list.append(a)

print(my_output_list)

CodePudding user response：

If you know the length of the piece you want to remove, you can just use:

parsed_protein_names=[string[11:] for string in parsed_protein_names]

Otherwise, it will get complicated. Do notice that the following algorithm also will remove the CD part.

minlen=len(sorted(parsed_protein_names,key=len)[0])
for x in range(minlen):
   if len(set([string[x] for string in parsed_protein_names]))!=1:
        break

parsed_protein_names=[string[x:] for string in parsed_protein_names]