Home > Software engineering >  Multi-part manipulation post str.split() Pandas
Multi-part manipulation post str.split() Pandas

Time:04-14

I have a subset of data (single column) we'll call ID:

    ID
0   07-1401469
1   07-89556629
2   07-12187595
3   07-381962
4   07-99999085

The current format is (usually) YY-[up to 8-character ID]. The desired output format is a more uniformed YYYY-xxxxxxxx:

    ID
0   2007-01401469
1   2007-89556629
2   2007-12187595
3   2007-00381962
4   2007-99999085

Knowing that I've done padding in the past, the thought process was to combine

df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))

However I ran into a few problems:

  1. The '20' in '{0:20>4}' must be a singular value and not a string
  2. Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))

The current solution I have (but HATE because its long, messy, and just not clean IMO) is:

df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str)   "-"   df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra

Which does work

      ID
0     2007-01401469
1     2007-89556629
2     2007-12187595
3     2007-00381962
4     2007-99999085

But my questions are

  1. Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
  2. Is there a better way than replacing 'XX' for '20'

I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.

CodePudding user response:

One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:

tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0]   '-'  tmp.str[1].str.zfill(8)

Output:

              ID
0  2007-01401469
1  2007-89556629
2  2007-12187595
3  2007-00381962
4  2007-99999085

CodePudding user response:

I'd do it in two steps, using .str.replace:

df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d )", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)

Prints:

              ID
0  2007-01401469
1  2007-89556629
2  2007-12187595
3  2007-00381962
4  2007-99999085
  • Related