My pandas dataframe currently has a column titled BinLocation that contains the location of a material in a warehouse. For example: If a part is located in column A02, row 33, and then level B21 then the BinLocation ID is A02033B21. For some columns, the format maybe A0233B21. The naming convention is not consistent but that was not up to me, and now I have to clean the data up. I want to split the string such that for any given input for the BinLocation, I can return the column, row and level. Ultimately, I want to create 3 new columns for the dataframe (column, row, level). In case it is not clear, the general structure of the ID is ColumnChar_ColumnInt_RowInt_ColumnChar_LevelInt

Now,for some BinLocations, the ID is separated by a hyphen so I wrote this code for those:

def forHyphenRow(s):
    return s.split('-')[1]
def forHyphenColumn(s):
    return s.split('-')[0]

def forHyphenLevel(s):
    return s.split('-')[2]

How do I do the same but for the other IDs? Also, in the dataframe is there anyway to group the columns in the dataframe all together? (so A02 are all grouped together, CB-22 are all grouped together etc)

CodePudding user response：

Here is an answer that:

uses Python regular expression syntax to parse your ID (handles cases with and without hyphens and can be tweaked to accommodate other quirks of historical IDs if needed)
puts the ID in a regularized format
adds columns for the ID components
sorts based on the ID components so rows are "grouped" together (though not in the "groupby" sense of pandas)

import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
    m = re.match('^([A-Z])([0-9]{2})-?([0-9] )-?([A-Z])([0-9]{2})$', s)
    if not m:
        return None
    tup = m.groups()
    colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
    tup = (colChar, colInt, rowInt, levelChar, levelInt)
    return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)

Output:

   BinLocation
0     A0233B21
1    A02033B21
2  A02-033-B21
3   A02-33-B21
4   A02-33-B15
5   A02-30-B21
6   A01-33-B21

   BinLocation RawBinLocation ColChar  ColInt  RowInt LevChar  LevInt
0  A01-033-B21     A01-33-B21       A       1      33       B      21
1  A02-030-B21     A02-30-B21       A       2      30       B      21
2  A02-033-B15     A02-33-B15       A       2      33       B      15
3  A02-033-B21       A0233B21       A       2      33       B      21
4  A02-033-B21      A02033B21       A       2      33       B      21
5  A02-033-B21    A02-033-B21       A       2      33       B      21
6  A02-033-B21     A02-33-B21       A       2      33       B      21

CodePudding user response：

If there will always be the first three characters of a string as Column, and last three as Level (and therefore Row as everything in-between):

def forNotHyphenColumn(s):
    return s[:3]


def forNotHyphenLevel(s):
    return s[-3:]


def forNotHyphenRow(s):
    return s[3:-3]

Then, you could sort your DataFrame by Column by creating separate DataFrame columns for the BinLocation items and using df.sort_values():

df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)

df
#Out: 
#  BinLocation Column  Row Level
#0   A02033B21    A02  033   B21
#2    A0233B21    A02   33   B21
#1   C02044C12    C02  044   C12

EDIT:

To also use the hyphen functions in the apply():

df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21", "A01-33-C13"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forHyphenColumn(x) if "-" in x else forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forHyphenRow(x) if "-" in x else forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forHyphenLevel(x) if "-" in x else forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)

df
#Out: 
#  BinLocation Column  Row Level
#3  A01-33-C13    A01   33   C13
#0   A02033B21    A02  033   B21
#2    A0233B21    A02   33   B21
#1   C02044C12    C02  044   C12