Home > front end >  Remove every word after hitting an integer in a list of strings (including the number)
Remove every word after hitting an integer in a list of strings (including the number)

Time:02-18

The following is a subset of a list I'm having trouble with:

array(['DORFLEX 10 CP AV CH', 'CLOR.CICLOBENZAPRINA 5MG 30 CP EMS GEN C',
       'ADVIL MULHER 400MG AVULSO', 'SPIDUFEN MENTA 600MG C/10 SACHES L',
       'PONSTAN 500MG 8X3 CP', 'TANDRILAX 30 CP',
       'PARACETAMOL 750MG 20 CP NEOQ GEN',
       'DICLOFENACO SOD 50MG 20 CP MEDL GEN C','DORFLEX 30CP',
       'BENLYSTA 200MG/ML SOL INJ 4 SER PRE 1ML GELAD'], dtype=object)

The problem is that, for example, the first element of the list: "DORFLEX 10 ..." (and many other names) appear repeatedly in the list with the same name, but the number that follows it is different (different sizes), for example, "DORFLEX 15 ...". I'm trying to leave only the word "DORFLEX". Dropping the string after a space would solve this specific problem, but I have a lot of compound names like "DICLOFENAC SOD 50MG ...". That's why I'm wanting to drop the entire string after reaching the number (including the number) in order to reduce the number of products that are repeated but appear in different sizes.

So far I haven't found anything that brings me close to that. Any help is welcome. Thank you very much in advance

CodePudding user response:

You can use forward look-up (regex):

re.search('.*?(?=( \d)|$)', some_string).group()

Or apply this to the entire list:

[re.search('.*?(?=( \d)|$)', line).group() for line in lines]

Then you get all your names in one go:

['DORFLEX',
 'CLOR.CICLOBENZAPRINA',
 'ADVIL MULHER',
 'SPIDUFEN MENTA',
 'PONSTAN',
 'TANDRILAX',
 'PARACETAMOL',
 'DICLOFENACO SOD',
 'DORFLEX',
 'BENLYSTA']

CodePudding user response:

Not sure why you would use a numpy array, but you could use a regex to split on the first digit (if any).

Taking a list as example here:

l =   ['DORFLEX 10 CP AV CH', 'CLOR.CICLOBENZAPRINA 5MG 30 CP EMS GEN C',
       'ADVIL MULHER 400MG AVULSO', 'SPIDUFEN MENTA 600MG C/10 SACHES L',
       'PONSTAN 500MG 8X3 CP', 'TANDRILAX 30 CP',
       'PARACETAMOL 750MG 20 CP NEOQ GEN',
       'DICLOFENACO SOD 50MG 20 CP MEDL GEN C','DORFLEX 30CP',
       'BENLYSTA 200MG/ML SOL INJ 4 SER PRE 1ML GELAD']

import re

out = [re.split('\s?\d', s, maxsplit=1)[0] for s in l]

or using re.search:

out = [re.search('(\D*)\s?', line).group() for line in l]

output:

['DORFLEX', 'CLOR.CICLOBENZAPRINA', 'ADVIL MULHER', 'SPIDUFEN MENTA', 'PONSTAN',
 'TANDRILAX', 'PARACETAMOL', 'DICLOFENACO SOD', 'DORFLEX', 'BENLYSTA']

If really you have a numpy array a, this is the same logic:

out = np.array([re.split('\s?\d', s, maxsplit=1)[0] for s in a])

CodePudding user response:

You can do it like this:

input_list = [
    'DORFLEX 10 CP AV CH',
    'CLOR.CICLOBENZAPRINA 5MG 30 CP EMS GEN C',
    'ADVIL MULHER 400MG AVULSO',
    'SPIDUFEN MENTA 600MG C/10 SACHES L',
    'PONSTAN 500MG 8X3 CP', 
    'TANDRILAX 30 CP',
    'PARACETAMOL 750MG 20 CP NEOQ GEN',
    'DICLOFENACO SOD 50MG 20 CP MEDL GEN C',
    'DORFLEX 30CP',
    'BENLYSTA 200MG/ML SOL INJ 4 SER PRE 1ML GELAD'
]
output_list = []

for element in input_list:
    for index, char in enumerate(element):
        if char.isnumeric():
            output_list.append(element[:index])
            break

print(output_list)
# output:
# ['DORFLEX ',
#  'CLOR.CICLOBENZAPRINA ',
#  'ADVIL MULHER ',
#  'SPIDUFEN MENTA ',
#  'PONSTAN ',
#  'TANDRILAX ',
#  'PARACETAMOL ',
#  'DICLOFENACO SOD ',
#  'DORFLEX ',
#  'BENLYSTA ']

We basically iterate over the elements of input_list, then over the indices and characters of each element. Once we hit a numerical character, we slice the element up to that index and append it to an output_list. We then break from the inner for loop in order to move on to the next element immediately.

CodePudding user response:

Out of curiosity, I decided to compare the performance of various methods. I used this code:

#!/usr/bin/env python3
from __future__ import annotations
import re
from timeit import timeit

l = [
    "DORFLEX 10 CP AV CH",
    "CLOR.CICLOBENZAPRINA 5MG 30 CP EMS GEN C",
    "ADVIL MULHER 400MG AVULSO",
    "SPIDUFEN MENTA 600MG C/10 SACHES L",
    "PONSTAN 500MG 8X3 CP",
    "TANDRILAX 30 CP",
    "PARACETAMOL 750MG 20 CP NEOQ GEN",
    "DICLOFENACO SOD 50MG 20 CP MEDL GEN C",
    "DORFLEX 30CP",
    "BENLYSTA 200MG/ML SOL INJ 4 SER PRE 1ML GELAD",
]

PROG = re.compile(".*?(?=( \d)|$)")


def with_compiled_regex(l: list[str]) -> list[str]:
    return [PROG.search(x).group() for x in l]


def with_regex(l: list[str]) -> list[str]:
    return [re.search(".*?(?=( \d)|$)", x).group() for x in l]


def without(l: list[str]) -> list[str]:
    updated = []
    for item in l:
        new = []
        for sub in item.split(" "):
            if sub[0].isnumeric():
                break
            new.append(sub)
        updated.append(" ".join(new))
    return updated


print(timeit("with_regex(l)", globals=globals(), number=10000))
print(timeit("with_compiled_regex(l)", globals=globals(), number=10000))
print(timeit("without(l)", globals=globals(), number=10000))

This provides very consistent results near:

0.130591894
0.07869331399999999
0.07071069099999999

As you can see, without regex as probably expected is faster... but not by much if it is compiled. Therefore, if you intend to do this much, you'd benefit by either not using regex or compiling your search.

CodePudding user response:

Other answers seem to remove all numeric values from your name strings, or use regex.

Here is a more pythonic alternative:

def extractName(fullStr):
   spltName = fullStr.split(' ')
   newSpltName = []
   for subStr in spltName:
      if not subStr.isdigit()
         newSpltName.append(subStr)
      else:
         break
   return ' '.join(newSpltName)

new_l = map(extractName, l)
  • Related