Home > Software design >  Pandas: Large CSV file data manipulation
Pandas: Large CSV file data manipulation

Time:10-06

I have a large dataset from a CSV file. It has two columns, the first is Date/Time in hh:mm:ss:ms form and the other is Pressure in number form. The pressure has randon values throughout that are not numerical values (things like 150 AA42BB43). These appear randomly throughout the 50,000 rows in the file and are not the same.

I need a way to change these Pressure values to numeric so I can perform data manipulation on them.

df_cleaned = df['Pressure'].loc[~df['Pressure'].map(lambda x: isinstance(x, float) | isinstance(x, int))]

I tried this, but it got rid of my Date/Time values and also did not clean all the pressure values while also getting rid of my headers.

I was wondering if anyone had any suggestions on how I can easily clean the data in the 2nd column while also keeping my Date/Time values in the first column accurate.

CodePudding user response:

I think I've an answer if all your non-numerical values are strings.

Have you tried using pandas replace()? Something like:

df['Pressure'].replace(to_replace = r'. ', value=0, inplace=True, regex=True)

I've used a regex to determine "any string". inplace=True allows that the existing dataframe is modified, instead of creating a new one.

Here, the function will replace any string with a given integer. I am not sure which integer you'd like to put there, so I've just used zero as an example. If you'd like a different integer for each string, you could use a map as explained in this answer.

CodePudding user response:

Your problem is that you use

df_cleaned = df['Pressure']

and this get only one column (Pressure) and skip other columns. And when you get single column then it may gives you Series instead of DataFrame - and Series can keep only one column so it doesn't need header to select columns.

You should run it without ['Pressure']

df_cleaned = df.loc[ ~df['Pressure'].map(...) ]

or even

df_cleaned = df[ ~df['Pressure'].map(...) ]

BTW: shorter isinstance(x, (float, int))

But using isinstance may not work if you have float/int values as strings - because isinstance("123", (float, int)) gives False - and you would have to rather try to convert float("123") and int("123") and catch error.


import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': [78, '150 AA42BB43', 23], 
}

df = pd.DataFrame(data)

df_cleaned = df[ df['Pressure'].map(lambda x:isinstance(x, (float, int))) ]

print(df_cleaned)

Result:

     DateTime Pressure
0  2021.10.04       78
2  2021.10.06       23

EDIT:

If you have values as string then you can use to_numeric to convert them and put NaN if value can't be converted

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')

and then you can filter it usinf isna()

df_cleaned = df[ ~df['Pressure'].isna() ]

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': ['78.2', '150 AA42BB43', '23'], 
}

df = pd.DataFrame(data)

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
print(df)

df_cleaned = df[ ~df['Pressure'].isna() ]
print(df_cleaned)
  • Related