get prefix out a size range with different size formats-CodePudding

I have column in a df with a size range with different sizeformats.

artikelkleurnummer  size
    6725    0161810ZWA  B080
    6726    0161810ZWA  B085
    6727    0161810ZWA  B090
    6728    0161810ZWA  B095
    6729    0161810ZWA  B100

in the sizerange are also these other size formats like XS - XXL, 36-50 , 36/38 - 52/54, ONE, XS/S - XL/XXL, 363-545

I have tried to get the prefix '0' out of all sizes with start with a letter in range (A:K). For exemple: Want to change B080 into B80. B100 stays B100.

steps: 1 look for items in column ['size'] with first letter of string in range (A:K), 2 if True change second position in string into ''

for range I use:

from string import ascii_letters

def range_alpha(start_letter, end_letter):
  return ascii_letters[ascii_letters.index(start_letter):ascii_letters.index(end_letter)   1]

then I've tried a for loop

for items in df['size']:
    if df.loc[df['size'].str[0] in range_alpha('A','K'):
            df.loc[df['size'].str[1] == ''

message

SyntaxError: unexpected EOF while parsing

what's wrong?

CodePudding user response：

You can do it with regex and the pd.Series.str.replace -

df = pd.DataFrame([['0161810ZWA']*5, ['B080', 'B085', 'B090', 'B095', 'B100']]).T
df.columns = "artikelkleurnummer  size".split() 
replacement = lambda mpat: ''.join(g for g in mpat.groups() if mpat.groups().index(g) != 1)
df['size_cleaned'] = df['size'].str.replace(r'([a-kA-K])(0*)(\d )', replacement)

Output

  artikelkleurnummer  size size_cleaned
0         0161810ZWA  B080          B80
1         0161810ZWA  B085          B85
2         0161810ZWA  B090          B90
3         0161810ZWA  B095          B95
4         0161810ZWA  B100         B100

TL;DR

Find a pattern "LetterZeroDigits" and change it to "LetterDigits" using a regular expression.

Slightly longer explanation

Regexes are very handy but also hard. In the solution above, we are trying to find the pattern of interest and then replace it. In our case, the pattern of interest is made of 3 parts -

A letter in from A-K
Zero or more 0's
Some more digits

In regex terms - this can be written as r'([a-kA-K])(0*)(\d )'. Note that the 3 brackets make up the 3 parts - they are called groups. It might make a little or no sense depending on how exposed you have been to regexes in the past - but you can get it from any introduction to regexes online.

Once we have the parts, what we want to do is retain everything else except part-2, which is the 0s.

The pd.Series.str.replace documentation has the details on the replacement portion. In essence replacement is a function that takes all the matching groups as the input and produces an output.

In the first part - where we identified three groups or parts. These groups are accessed with the mpat.groups() function - which returns a tuple containing the match for each group. We want to reconstruct a string with the middle part excluded, which is what the replacement function does

CodePudding user response：

sizes = [{"size": "B080"},{"size": "B085"},{"size": "B090"},{"size": "B095"},{"size": "B100"}]

def range_char(start, stop):
    return (chr(n) for n in range(ord(start), ord(stop)   1))

for s in sizes:
    if s['size'][0].upper() in range_char("A", "K"):
        s['size'] = s['size'][0] s['size'][1:].lstrip('0')

print(sizes)

Using a List/Dict here for example. Lmk if its helping you.