I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
CodePudding user response:
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}
. this pattern consumes four 0
s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip
gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
CodePudding user response:
Use a regex with str.replace
:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to @HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
CodePudding user response:
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))