Clean column in data frame-CodePudding

I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:

         Remarks
0   S0252508 Shippment UK
1   S0255111 Shippment UK
2   S0256352 Shippment UK
3   S0259138 Shippment UK
4   S0260425 Shippment US

I've menaged to separate those rows which has the format S1234567 text using the code below:

merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)

After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:

    Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31  cok
5. S0246732-
6. 2
7. ok
8. nan

And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.

CodePudding user response：

I think the easiest solution is to use regular expressions and a list comprehension:

import re
import pandas as pd
merged_out['Remarks'] = re.split('\s', i)[0]  for i in merged_out['Remarks']]

CodePudding user response：

Here is a regex approach that will do what I think your question asks:

import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})

pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
    pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']), 
    merged_out], 
    axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat   r'\s*', '', regex=True)

Input:

                          Remarks
0           S0252508 Shippment UK
1           S0255111 Shippment UK
2           S0256352 Shippment UK
3  S0259138/S0259139 Shippment UK
4          S12345678 Shippment UK
5           S0260425 Shippment US

Output:

     Number                 Remarks
0  S0252508            Shippment UK
1  S0255111            Shippment UK
2  S0256352            Shippment UK
3  S0259138            Shippment UK
3  S0259139            Shippment UK
5  S0260425            Shippment US
4       NaN  S12345678 Shippment UK

Explanation:

with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary \b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries \b)
use unstack() to eliminate multiple index levels
use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
use explode() to create a new rows for lists with more than one Number item
with Series.str.replace(), filter out the number matches using the previous pattern, plus r'\s*' to match trailing whitespace characters, to obtain the residual Remarks

Notes:

all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)