How to standardize column in pandas-CodePudding

I have dataframe which contains id column with the following sample values

16620625 5686

16310427-5502

16501010 4957

16110430 8679

16990624/4174
  
16230404.1177

16820221/3388

I want to standardise to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), How can I achieve that using python.

here's my code

df['id']
df.replace(" ", "-")

CodePudding user response：

Can use DataFrame.replace() function using a regular expression like this:

df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')

Here's example code with sample data.

import pandas as pd
df = pd.DataFrame({'id': [
            '16620625 5686',
            '16310427-5502',
            '16501010 4957',
            '16110430 8679',
            '16990624/4174',
            '16230404.1177',
            '16820221/3388']})

# normalize matching strings with 8-digits   delimiter   4-digits
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
print(df)

Output:

              id
0  16620625-5686
1  16310427-5502
2  16501010-4957
3  16110430-8679
4  16990624-4174
5  16230404-1177
6  16820221-3388

If any value does not match the regexp of the expected format then it's value will not be changed.

CodePudding user response：

inside a for loop:

convert your data frame entry to a string.
traverse this string up to 7th index.
concatenate '-' after 7th index to the string.
concatenate remaining string to the end.
traverse to next data frame entry.

CodePudding user response：

If your 'id' is that structured, then you can slice the string like this:

df['ID2'] = df['ID'].str[:7]   '-'   df["ID"].str[9:]

Output:

              ID           ID2
0  16620625 5686  1662062-5686
1  16310427-5502  1631042-5502
2  16501010 4957  1650101-4957
3  16110430 8679  1611043-8679
4  16990624/4174  1699062-4174
5  16230404.1177  1623040-1177
6  16820221/3388  1682022-3388