I have column in a df with a size range with different sizeformats.
artikelkleurnummer size
6725 0161810ZWA B080
6726 0161810ZWA B085
6727 0161810ZWA B090
6728 0161810ZWA B095
6729 0161810ZWA B100
in the sizerange are also these other size formats like XS - XXL, 36-50 , 36/38 - 52/54, ONE, XS/S - XL/XXL, 363-545
I have tried to get the prefix '0' out of all sizes with start with a letter in range (A:K). For exemple: Want to change B080 into B80. B100 stays B100.
steps: 1 look for items in column ['size'] with first letter of string in range (A:K), 2 if True change second position in string into ''
for range I use:
from string import ascii_letters
def range_alpha(start_letter, end_letter):
return ascii_letters[ascii_letters.index(start_letter):ascii_letters.index(end_letter) 1]
then I've tried a for loop
for items in df['size']:
if df.loc[df['size'].str[0] in range_alpha('A','K'):
df.loc[df['size'].str[1] == ''
message
SyntaxError: unexpected EOF while parsing
what's wrong?
CodePudding user response:
You can do it with regex and the pd.Series.str.replace
-
df = pd.DataFrame([['0161810ZWA']*5, ['B080', 'B085', 'B090', 'B095', 'B100']]).T
df.columns = "artikelkleurnummer size".split()
replacement = lambda mpat: ''.join(g for g in mpat.groups() if mpat.groups().index(g) != 1)
df['size_cleaned'] = df['size'].str.replace(r'([a-kA-K])(0*)(\d )', replacement)
Output
artikelkleurnummer size size_cleaned
0 0161810ZWA B080 B80
1 0161810ZWA B085 B85
2 0161810ZWA B090 B90
3 0161810ZWA B095 B95
4 0161810ZWA B100 B100
TL;DR
Find a pattern "LetterZeroDigits" and change it to "LetterDigits" using a regular expression.
Slightly longer explanation
Regexes are very handy but also hard. In the solution above, we are trying to find the pattern of interest and then replace it. In our case, the pattern of interest is made of 3 parts -
- A letter in from A-K
- Zero or more
0
's - Some more digits
In regex terms - this can be written as r'([a-kA-K])(0*)(\d )'
. Note that the 3 brackets make up the 3 parts - they are called groups. It might make a little or no sense depending on how exposed you have been to regexes in the past - but you can get it from any introduction to regexes online.
Once we have the parts, what we want to do is retain everything else except part-2, which is the 0
s.
The pd.Series.str.replace
documentation has the details on the replacement portion. In essence replacement
is a function that takes all the matching groups as the input and produces an output.
In the first part - where we identified three groups or parts. These groups are accessed with the mpat.groups()
function - which returns a tuple
containing the match for each group. We want to reconstruct a string with the middle part excluded, which is what the replacement
function does
CodePudding user response:
sizes = [{"size": "B080"},{"size": "B085"},{"size": "B090"},{"size": "B095"},{"size": "B100"}]
def range_char(start, stop):
return (chr(n) for n in range(ord(start), ord(stop) 1))
for s in sizes:
if s['size'][0].upper() in range_char("A", "K"):
s['size'] = s['size'][0] s['size'][1:].lstrip('0')
print(sizes)
Using a List/Dict here for example. Lmk if its helping you.