How to remove numbers from a string column that starts with 4 zeros?-CodePudding

I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:

data = {
    'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
  
testdf = pd.DataFrame(data)

The correct output would be:

results = {
        'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
    }
      
    results = pd.DataFrame(results)

CodePudding user response：

you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space

testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0                 ANOA 250g
1               ANOA 10000g
2    80%c asjw 150000001568
3                  Shivangi

note that this works for the case where the codes are always at the end of your string.

CodePudding user response：

Use a regex with str.replace:

testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
                                            '', regex=True)

Or, similar to @HaleemurAli, with a negative match

testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
                                            '', regex=True)

Output:

                      Name
0                ANOA 250g
1              ANOA 10000g
2  80%c asjw 150000001568 
3                 Shivangi

regex1 demo

regex2 demo

CodePudding user response：

try splitting it at each space and checking if the each item has 0000 in it like:

answer=[]
for i in results["Name"]:
    answer.append("".join([j for j in i.split() if "0000" not in j]))