I have a large dataset from a CSV file. It has two columns, the first is Date/Time in hh:mm:ss:ms form and the other is Pressure in number form. The pressure has randon values throughout that are not numerical values (things like 150 AA42BB43). These appear randomly throughout the 50,000 rows in the file and are not the same.
I need a way to change these Pressure values to numeric so I can perform data manipulation on them.
df_cleaned = df['Pressure'].loc[~df['Pressure'].map(lambda x: isinstance(x, float) | isinstance(x, int))]
I tried this, but it got rid of my Date/Time values and also did not clean all the pressure values while also getting rid of my headers.
I was wondering if anyone had any suggestions on how I can easily clean the data in the 2nd column while also keeping my Date/Time values in the first column accurate.
CodePudding user response:
I think I've an answer if all your non-numerical values are strings.
Have you tried using pandas replace()? Something like:
df['Pressure'].replace(to_replace = r'. ', value=0, inplace=True, regex=True)
I've used a regex to determine "any string". inplace=True
allows that the existing dataframe is modified, instead of creating a new one.
Here, the function will replace any string with a given integer. I am not sure which integer you'd like to put there, so I've just used zero as an example. If you'd like a different integer for each string, you could use a map as explained in this answer.
CodePudding user response:
Your problem is that you use
df_cleaned = df['Pressure']
and this get only one column (Pressure
) and skip other columns. And when you get single column then it may gives you Series
instead of DataFrame
- and Series
can keep only one column so it doesn't need header
to select columns.
You should run it without ['Pressure']
df_cleaned = df.loc[ ~df['Pressure'].map(...) ]
or even
df_cleaned = df[ ~df['Pressure'].map(...) ]
BTW: shorter isinstance(x, (float, int))
But using isinstance
may not work if you have float/int
values as strings - because isinstance("123", (float, int))
gives False
- and you would have to rather try to convert float("123")
and int("123")
and catch error.
import pandas as pd
data = {
'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'],
'Pressure': [78, '150 AA42BB43', 23],
}
df = pd.DataFrame(data)
df_cleaned = df[ df['Pressure'].map(lambda x:isinstance(x, (float, int))) ]
print(df_cleaned)
Result:
DateTime Pressure
0 2021.10.04 78
2 2021.10.06 23
EDIT:
If you have values as string then you can use to_numeric
to convert them and put NaN
if value can't be converted
df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
and then you can filter it usinf isna()
df_cleaned = df[ ~df['Pressure'].isna() ]
import pandas as pd
data = {
'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'],
'Pressure': ['78.2', '150 AA42BB43', '23'],
}
df = pd.DataFrame(data)
df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
print(df)
df_cleaned = df[ ~df['Pressure'].isna() ]
print(df_cleaned)