Home > Software engineering >  pd.read_csv ignore comma if it is inside parenthesis
pd.read_csv ignore comma if it is inside parenthesis

Time:10-12

I have a very simple file:

[Name]
Streamline 1


[Data]
X [ m ], Y [ m ], Z [ m ], Velocity [ m s^-1 ]
2.66747564e-01, 0.00000000e 00, 2.03140453e-01, (0.00000000e 00, 8.17744827e 00, 0.00000000e 00)
2.66958952e-01, 0.00000000e 00, 2.07407191e-01, (0.00000000e 00, 6.77392197e 00, 0.00000000e 00)
2.63460875e-01, 0.00000000e 00, 2.06593186e-01, (0.00000000e 00, 7.04168701e 00, 0.00000000e 00)
2.65424699e-01, 0.00000000e 00, 2.00831652e-01, (0.00000000e 00, 8.93691921e 00, 0.00000000e 00)
2.70607203e-01, 0.00000000e 00, 2.02286631e-01, (0.00000000e 00, 8.45830917e 00, 0.00000000e 00)
2.68299729e-01, 0.00000000e 00, 1.97365344e-01, (0.00000000e 00, 1.00771456e 01, 0.00000000e 00)
...

I need to load the velocity as a vector, into a single row.

My basic code:

df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5)

But this attempt leads to 1st 2 cols becoming index and the rest splits into 4 columns. index_col=False can solve the issue with index, but leads to index out of range. I need a delimiter that implicitly tells pandas to ignore whatever is in brackets. I thought python ignore the separator withing brackets while reading a csv file might work but yes, I have spaces everywhere. I found some solutions that use extended functions to load files and handle them by lines, such as CSV file containing column with occasional comma in parentheses crashes pandas.read_csv and Load CSV with data surrounded by parentheses into a pandas dataframe . I however believe that this is a very easy scenario, as all lines are similar and can be solved by one-liner adding delimiter='some_regex'. I however cannot figure out, how this regex should look. It should look for delimiter , but not (.*,.*).

I have tried with following, but this results in a single column:

df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5,  delimiter=',^(\(.*,.*\))')

EDIT: got to something like this - ,|(?:(\(.*,.*\))), but this adds an empty column after each comma.

CodePudding user response:

You can parse manually the file:

data = []
with open('data.csv') as fp:
    [next(fp) for i in range(5)]  # skiprows=5
    headers = [c.strip() for c in next(fp).split(',')]
    for line in fp:
        data.append([i.strip() for i in re.split(r',(?![^\(]*[\)])', line)])

df = pd.DataFrame(data, columns=headers).apply(pd.eval)

Output:

>>> df
    X [ m ]  Y [ m ]   Z [ m ]     Velocity [ m s^-1 ]
0  0.266748      0.0  0.203140  [0.0, 8.17744827, 0.0]
1  0.266959      0.0  0.207407  [0.0, 6.77392197, 0.0]
2  0.263461      0.0  0.206593  [0.0, 7.04168701, 0.0]
3  0.265425      0.0  0.200832  [0.0, 8.93691921, 0.0]
4  0.270607      0.0  0.202287  [0.0, 8.45830917, 0.0]
5  0.268300      0.0  0.197365  [0.0, 10.0771456, 0.0]

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   X [ m ]              6 non-null      float64
 1   Y [ m ]              6 non-null      float64
 2   Z [ m ]              6 non-null      float64
 3   Velocity [ m s^-1 ]  6 non-null      object 
dtypes: float64(3), object(1)
memory usage: 320.0  bytes

>>> type(df.iloc[0, 3])  # [0.0, 8.17744827, 0.0]
list

>>> type(df.iloc[0, 3][1])  # 8.17744827
float

CodePudding user response:

After numerous attempts, I have found an answer how to create a very simple one-liner on this. Here it is if anyone is interested:

df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5,  delimiter=',(?![^\(]*[\)])', engine="python")

Delimiter checks for the comma in everything outside the brackets. Simple like a charm :)

  • Related