On the column that I’d like to filter, the column contains data from two different sources. I’d like to normalize this data. We collected some data a certain way and the other rows of data contain data that was collected another way. There are rows that contain 1.2 2.3 3.4 and nothing over 5. I would like to multiply these numbers by 1,000 to match up with the others and remove the comma from the numbers above 1,000.
col1 | col2 |
---|---|
1 | 1.641 |
2 | 1.548 |
3 | 1,807.000 |
4 | 1,759.000 |
CodePudding user response:
It sounds like you want to filter some rows (col2 < 5
), apply a transformation (col2 * 1000
) then remove something (,
).
df.loc[df['col2']<=5,'col2'] = df['col2']*1000
Next would be to remove the comma but if you know all the values in col2 are whole numbers (no decimals) then I think you can just
df['col2'] = int(df['col2'])
But its safer to apply a replace but only if the values are string (if not, df['col2'] = str(df['col2'])
)
Then you can apply the following:
df['col'2'] = df['col2'].str.replace(',','')
CodePudding user response:
You can create a boolean mask on whether 'col2'
contains a comma or not. If it contains a comma, remove the comma. For numbers with no comma, multiply it by 1000:
mask = df['col2'].str.contains(',')
df.loc[mask, 'col2'] = df.loc[mask, 'col2'].str.replace(',','')
df['col2'] = df['col2'].astype(float)
df.loc[~mask, 'col2'] *= 1000
Output:
col1 col2
0 1 1641.0
1 2 1548.0
2 3 1807.0
3 4 1759.0