Home > Net >  Pandas df function to find and replace cell that dont have a value
Pandas df function to find and replace cell that dont have a value

Time:11-25

I'm using pandas DF to process a large amount of data (500 columns, 30,000 rows) from simulations in our lab.
The data is read from an external file and has small values in the following format:

0.12750246E-61           0.12850512E-26           0.36285493E-05           0.25550131E-57

if the value is smaller than E-99 the file omits the E and writes it as: 0.19768978-156
Unfortunately, it's a legacy simulation so I cant change the way it exports the data.
Right now in reading the df as 'str' and I have to go cell by cell to find where E is not present and replace it manually.
For large files, it's really slow and I'm sure there is a better way to do it.

What I'm actually asking:

  1. How would you find the out-of-format values and zero them (preferably while reading the data)?
  2. What dtype would you use to retain as much accuracy as possible?

The code:
Reading the data and adding headers:

qt_headers = pd.DataFrame.to_numpy(
        pd.read_csv("qt_species_list.txt", delim_whitespace=True, index_col=0,
                    names=["h"]))  # reads the CHEMKIN headers and converts them to a data frame
    qt_headers = np.append(np.array(["TIME", "TEMP"]), qt_headers)  # adds the Time & Temp col
    moles_df = pd.read_csv(f"OUTPUT\\output1.plt", index_col=False, delim_whitespace=True, dtype=str,
                           names=np.arange(0, len(qt_headers)))
    moles_df.columns = qt_headers

Finding the problematic cells:

    for col in moles_df.columns:
        for idx, cell in enumerate(moles_df[col]):
            if "E" not in cell:
                moles_df[col][idx] = "0.00000000E 00"
        moles_df[col] = moles_df[col].astype(dtype=np.float64)

things I have tried:

  1. separating to the columns and using isin
  2. pd.replace
  3. pd.mask

Thanks!

CodePudding user response:

Looping through all the rows in all the columns is quite slow.

Try this method (called vectorization):

Edited

for column in moles.df.columns
    moles_df.loc[moles_df[column].str.contains("E") == False, column] = "0.00000000E 00"
  • Related