I'm performing a data analysis where one of the steps is to create a key by combining several fields.
Unfortunally, the number of digits in a given field is not always the same.
Some information
- Datatype of
my_field
isobject
; nan
values have been replaced by the'-'
character.- But, basically, the
my_field
is numbers (INTEGER) formatted in Text.
Code
import pandas as pd
import numpy as np
data ={'product': ['PA1', 'PA2', 'PA3', 'PA4', 'PA5', 'PA6', 'PA7', 'PA8'],
'my_field': ['001', '0000000000002', '3', '04', '-', '5', '-', '6']}
df = pd.DataFrame(data)
df
Raw Data
product | my_field | |
---|---|---|
0 | PA1 | 001 |
1 | PA2 | 0000000000002 |
2 | PA3 | 3 |
3 | PA4 | 04 |
4 | PA5 | - |
5 | PA6 | 5 |
6 | PA7 | - |
7 | PA8 | 6 |
My Aproach:
df['my_field'] = np.where(df['my_field'] == '-', '-' , df['my_field'].str.zfill(10) )
df
My Output:
product | my_field | |
---|---|---|
0 | PA1 | 0000000001 |
1 | PA2 | 0000000000002 |
2 | PA3 | 0000000003 |
3 | PA4 | 0000000004 |
4 | PA5 | - |
5 | PA6 | 0000000005 |
6 | PA7 | - |
7 | PA8 | 0000000006 |
Desired Output:
product | my_field | |
---|---|---|
0 | PA1 | 0000000001 |
1 | PA2 | 0000000002 |
2 | PA3 | 0000000003 |
3 | PA4 | 0000000004 |
4 | PA5 | - |
5 | PA6 | 0000000005 |
6 | PA7 | - |
7 | PA8 | 0000000006 |
The problem: Some outputs get more then 10 char.
CodePudding user response:
What about slicing after zfill
, this way you'll keep the last 10 characters only:
df['my_field'] = np.where(df['my_field'] == '-', '-', df['my_field'].str.zfill(10).str[-10:])
Alternative with boolean indexing:
df.loc[df['my_field'] != '-',
'my_field'] = df['my_field'].str.zfill(10).str[-10:]
Output:
product my_field
0 PA1 0000000001
1 PA2 0000000002
2 PA3 0000000003
3 PA4 0000000004
4 PA5 -
5 PA6 0000000005
6 PA7 -
7 PA8 0000000006
CodePudding user response:
def myfield_format(x):
field=(10-len(str(x)))*'0' str(x) if x!='-' else '-'
return field
df['my_field']=df['my_field'].map(lambda x: myfield_format(x))