Home > Enterprise >  How can I reshape data in a csv into a structured format?
How can I reshape data in a csv into a structured format?

Time:01-31

I have some .csv files that are generated from Computational Fluid Dynamics simulations. They contain the values of the velocity, pressure, density, etc at given points in space. For each point, its coordinates and the values of the fields at that point are printed on a row in the csv file. For a 2D grid with x values of 1,2,3 and y values of 4,5,6, the data is arranged in the following way:


X Y (field variables)
1 4         :
2 4         :
3 4         :
1 5         :
2 5         :
3 5         :
1 6         :
2 6         :
3 6         :

We start with the lowest y value, cycle through all the x values, then go to the next y value and repeat.

What I would like to do is put this data into a structured format. I.e, I would like to put the data into a xarray dataset that uses the x and y values as coordinate axes, or put the values into a numpy ndarray of the proper shape (in this case, 3x3.). I could load the file into a Pandas dataframe and then restructure the data manually using for loops, but this is extremely slow for even moderately large data files. I would like a faster way that uses inbuilt functions from the pandas, numpy, and xarray libraries.

Anybody have any ideas?

CodePudding user response:

Fellow CFD researcher here. I believe this can be accomplished manually with a single for loop that iterates only over your state variable list (i.e. rho).

# https://stackoverflow.com/questions/75278985/how-can-i-reshape-data-in-a-csv-into-a-structured-format
import time
start = time.time()

import numpy as np
import pandas as pd

df = pd.read_csv('test_data.csv')

min_x_coor = min(df['x'])
min_y_coor = min(df['y'])

x_dim = max(df['x']) - min_x_coor   1
y_dim = max(df['y']) - min_y_coor   1

rho_array = np.zeros((x_dim, y_dim))
for p in range(0, len(df['rho'])):
   x_coor = df['x'][p] - min_x_coor
   y_coor = df['y'][p] - min_y_coor

   rho_array[x_coor][y_coor] = df['rho'][p]

print(rho_array)
print(time.time() - start)

For 3x3 data:

x,y,rho
1,4,0.503
2,4,0.642
3,4,0.041
1,5,0.340
2,5,0.269
3,5,0.288
1,6,0.511
2,6,0.732
3,6,0.195

Output:

[[0.503 0.34  0.511]
 [0.642 0.269 0.732]
 [0.041 0.288 0.195]]
0.31889796257019043

For 4x4 data:

x,y,rho
1,4,0.503
2,4,0.642
3,4,0.041
4,4,0.964
1,5,0.340
2,5,0.269
3,5,0.288
4,5,0.702
1,6,0.511
2,6,0.732
3,6,0.195
4,6,0.226
1,7,0.957
2,7,0.032
3,7,0.304
4,7,0.607

Output:

[[0.503 0.34  0.511 0.957]
 [0.642 0.269 0.732 0.032]
 [0.041 0.288 0.195 0.304]
 [0.964 0.702 0.226 0.607]]
0.48914408683776855

CodePudding user response:

Here's an example that uses Xarray and Pandas together:

from io import StringIO

import pandas as pd
import xarray as xr

s = StringIO("""x,y,rho
1,4,0.503
2,4,0.642
3,4,0.041
4,4,0.964
1,5,0.340
2,5,0.269
3,5,0.288
4,5,0.702
1,6,0.511
2,6,0.732
3,6,0.195
4,6,0.226
1,7,0.957
2,7,0.032
3,7,0.304
4,7,0.607
""")

# open csv, create a MultiIndex from x/y columns
df = pd.read_csv(s, index_col=['x', 'y'])

# convert to Xarray Dataset
ds = df.to_xarray()

# copy over the multiindex
ds['points'] = df.index

# use ds.unstack to reshape
ds = ds.unstack()

This will produce an Xarray Dataset that looks like:

<xarray.Dataset>
Dimensions:  (x: 4, y: 4)
Coordinates:
  * x        (x) int64 1 2 3 4
  * y        (y) int64 4 5 6 7
Data variables:
    rho      (x, y) float64 0.503 0.34 0.511 0.957 ... 0.964 0.702 0.226 0.607
  • Related