My pandas dataframe currently has a column titled BinLocation that contains the location of a material in a warehouse. For example: If a part is located in column A02, row 33, and then level B21 then the BinLocation ID is A02033B21. For some columns, the format maybe A0233B21. The naming convention is not consistent but that was not up to me, and now I have to clean the data up. I want to split the string such that for any given input for the BinLocation, I can return the column, row and level. Ultimately, I want to create 3 new columns for the dataframe (column, row, level). In case it is not clear, the general structure of the ID is ColumnChar_ColumnInt_RowInt_ColumnChar_LevelInt
Now,for some BinLocations, the ID is separated by a hyphen so I wrote this code for those:
def forHyphenRow(s):
return s.split('-')[1]
def forHyphenColumn(s):
return s.split('-')[0]
def forHyphenLevel(s):
return s.split('-')[2]
How do I do the same but for the other IDs? Also, in the dataframe is there anyway to group the columns in the dataframe all together? (so A02 are all grouped together, CB-22 are all grouped together etc)
CodePudding user response:
Here is an answer that:
- uses Python regular expression syntax to parse your ID (handles cases with and without hyphens and can be tweaked to accommodate other quirks of historical IDs if needed)
- puts the ID in a regularized format
- adds columns for the ID components
- sorts based on the ID components so rows are "grouped" together (though not in the "groupby" sense of pandas)
import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
m = re.match('^([A-Z])([0-9]{2})-?([0-9] )-?([A-Z])([0-9]{2})$', s)
if not m:
return None
tup = m.groups()
colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
tup = (colChar, colInt, rowInt, levelChar, levelInt)
return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)
Output:
BinLocation
0 A0233B21
1 A02033B21
2 A02-033-B21
3 A02-33-B21
4 A02-33-B15
5 A02-30-B21
6 A01-33-B21
BinLocation RawBinLocation ColChar ColInt RowInt LevChar LevInt
0 A01-033-B21 A01-33-B21 A 1 33 B 21
1 A02-030-B21 A02-30-B21 A 2 30 B 21
2 A02-033-B15 A02-33-B15 A 2 33 B 15
3 A02-033-B21 A0233B21 A 2 33 B 21
4 A02-033-B21 A02033B21 A 2 33 B 21
5 A02-033-B21 A02-033-B21 A 2 33 B 21
6 A02-033-B21 A02-33-B21 A 2 33 B 21
CodePudding user response:
If there will always be the first three characters of a string as Column, and last three as Level (and therefore Row as everything in-between):
def forNotHyphenColumn(s):
return s[:3]
def forNotHyphenLevel(s):
return s[-3:]
def forNotHyphenRow(s):
return s[3:-3]
Then, you could sort your DataFrame by Column by creating separate DataFrame columns for the BinLocation items and using df.sort_values():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
EDIT:
To also use the hyphen functions in the apply()
:
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21", "A01-33-C13"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forHyphenColumn(x) if "-" in x else forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forHyphenRow(x) if "-" in x else forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forHyphenLevel(x) if "-" in x else forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#3 A01-33-C13 A01 33 C13
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12