I'm using pandas DF to process a large amount of data (500 columns, 30,000 rows) from simulations in our lab.
The data is read from an external file and has small values in the following format:
0.12750246E-61 0.12850512E-26 0.36285493E-05 0.25550131E-57
if the value is smaller than E-99 the file omits the E and writes it as: 0.19768978-156
Unfortunately, it's a legacy simulation so I cant change the way it exports the data.
Right now in reading the df as 'str' and I have to go cell by cell to find where E is not present and replace it manually.
For large files, it's really slow and I'm sure there is a better way to do it.
What I'm actually asking:
- How would you find the out-of-format values and zero them (preferably while reading the data)?
- What dtype would you use to retain as much accuracy as possible?
The code:
Reading the data and adding headers:
qt_headers = pd.DataFrame.to_numpy(
pd.read_csv("qt_species_list.txt", delim_whitespace=True, index_col=0,
names=["h"])) # reads the CHEMKIN headers and converts them to a data frame
qt_headers = np.append(np.array(["TIME", "TEMP"]), qt_headers) # adds the Time & Temp col
moles_df = pd.read_csv(f"OUTPUT\\output1.plt", index_col=False, delim_whitespace=True, dtype=str,
names=np.arange(0, len(qt_headers)))
moles_df.columns = qt_headers
Finding the problematic cells:
for col in moles_df.columns:
for idx, cell in enumerate(moles_df[col]):
if "E" not in cell:
moles_df[col][idx] = "0.00000000E 00"
moles_df[col] = moles_df[col].astype(dtype=np.float64)
things I have tried:
- separating to the columns and using
isin
pd.replace
pd.mask
Thanks!
CodePudding user response:
Looping through all the rows in all the columns is quite slow.
Try this method (called vectorization):
Edited
for column in moles.df.columns
moles_df.loc[moles_df[column].str.contains("E") == False, column] = "0.00000000E 00"