Home > Blockchain >  How to standardize column in pandas
How to standardize column in pandas

Time:04-10

I have dataframe which contains id column with the following sample values

16620625 5686

16310427-5502

16501010 4957

16110430 8679

16990624/4174
  
16230404.1177

16820221/3388

I want to standardise to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), How can I achieve that using python.

here's my code

df['id']
df.replace(" ", "-")

CodePudding user response:

Can use DataFrame.replace() function using a regular expression like this:

df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')

Here's example code with sample data.

import pandas as pd
df = pd.DataFrame({'id': [
            '16620625 5686',
            '16310427-5502',
            '16501010 4957',
            '16110430 8679',
            '16990624/4174',
            '16230404.1177',
            '16820221/3388']})

# normalize matching strings with 8-digits   delimiter   4-digits
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
print(df)

Output:

              id
0  16620625-5686
1  16310427-5502
2  16501010-4957
3  16110430-8679
4  16990624-4174
5  16230404-1177
6  16820221-3388

If any value does not match the regexp of the expected format then it's value will not be changed.

CodePudding user response:

inside a for loop:

  1. convert your data frame entry to a string.
  2. traverse this string up to 7th index.
  3. concatenate '-' after 7th index to the string.
  4. concatenate remaining string to the end.
  5. traverse to next data frame entry.

CodePudding user response:

If your 'id' is that structured, then you can slice the string like this:

df['ID2'] = df['ID'].str[:7]   '-'   df["ID"].str[9:]

Output:

              ID           ID2
0  16620625 5686  1662062-5686
1  16310427-5502  1631042-5502
2  16501010 4957  1650101-4957
3  16110430 8679  1611043-8679
4  16990624/4174  1699062-4174
5  16230404.1177  1623040-1177
6  16820221/3388  1682022-3388
  • Related