I have a very simple file:
[Name]
Streamline 1
[Data]
X [ m ], Y [ m ], Z [ m ], Velocity [ m s^-1 ]
2.66747564e-01, 0.00000000e 00, 2.03140453e-01, (0.00000000e 00, 8.17744827e 00, 0.00000000e 00)
2.66958952e-01, 0.00000000e 00, 2.07407191e-01, (0.00000000e 00, 6.77392197e 00, 0.00000000e 00)
2.63460875e-01, 0.00000000e 00, 2.06593186e-01, (0.00000000e 00, 7.04168701e 00, 0.00000000e 00)
2.65424699e-01, 0.00000000e 00, 2.00831652e-01, (0.00000000e 00, 8.93691921e 00, 0.00000000e 00)
2.70607203e-01, 0.00000000e 00, 2.02286631e-01, (0.00000000e 00, 8.45830917e 00, 0.00000000e 00)
2.68299729e-01, 0.00000000e 00, 1.97365344e-01, (0.00000000e 00, 1.00771456e 01, 0.00000000e 00)
...
I need to load the velocity as a vector, into a single row.
My basic code:
df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5)
But this attempt leads to 1st 2 cols becoming index and the rest splits into 4 columns. index_col=False
can solve the issue with index, but leads to index out of range. I need a delimiter that implicitly tells pandas to ignore whatever is in brackets. I thought python ignore the separator withing brackets while reading a csv file might work but yes, I have spaces everywhere. I found some solutions that use extended functions to load files and handle them by lines, such as CSV file containing column with occasional comma in parentheses crashes pandas.read_csv and Load CSV with data surrounded by parentheses into a pandas dataframe . I however believe that this is a very easy scenario, as all lines are similar and can be solved by one-liner adding delimiter='some_regex'
. I however cannot figure out, how this regex should look. It should look for delimiter ,
but not (.*,.*)
.
I have tried with following, but this results in a single column:
df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5, delimiter=',^(\(.*,.*\))')
EDIT: got to something like this - ,|(?:(\(.*,.*\)))
, but this adds an empty column after each comma.
CodePudding user response:
You can parse manually the file:
data = []
with open('data.csv') as fp:
[next(fp) for i in range(5)] # skiprows=5
headers = [c.strip() for c in next(fp).split(',')]
for line in fp:
data.append([i.strip() for i in re.split(r',(?![^\(]*[\)])', line)])
df = pd.DataFrame(data, columns=headers).apply(pd.eval)
Output:
>>> df
X [ m ] Y [ m ] Z [ m ] Velocity [ m s^-1 ]
0 0.266748 0.0 0.203140 [0.0, 8.17744827, 0.0]
1 0.266959 0.0 0.207407 [0.0, 6.77392197, 0.0]
2 0.263461 0.0 0.206593 [0.0, 7.04168701, 0.0]
3 0.265425 0.0 0.200832 [0.0, 8.93691921, 0.0]
4 0.270607 0.0 0.202287 [0.0, 8.45830917, 0.0]
5 0.268300 0.0 0.197365 [0.0, 10.0771456, 0.0]
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X [ m ] 6 non-null float64
1 Y [ m ] 6 non-null float64
2 Z [ m ] 6 non-null float64
3 Velocity [ m s^-1 ] 6 non-null object
dtypes: float64(3), object(1)
memory usage: 320.0 bytes
>>> type(df.iloc[0, 3]) # [0.0, 8.17744827, 0.0]
list
>>> type(df.iloc[0, 3][1]) # 8.17744827
float
CodePudding user response:
After numerous attempts, I have found an answer how to create a very simple one-liner on this. Here it is if anyone is interested:
df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5, delimiter=',(?![^\(]*[\)])', engine="python")
Delimiter checks for the comma in everything outside the brackets. Simple like a charm :)