Home > Software engineering >  Python read csv with array columns
Python read csv with array columns

Time:10-22

I have some csv files including array columns. For example:

a,b,c
1,1|2|3,4.5|5.5|6.5
2,7|8|9,10.5|11.5|12.5

Delimter 1 is , to sepperate fields a, b and c. Delimiter 2 is | in this case, but could be changed.

Is there a possibility in python to read this as a pandas dataframe directly? Field b and c should be an array/series inside the dataframe.

What I do now is reading the csv as strings:

df = pd.read_csv('data.csv', dtype='str')

Then use np.fromstring to convert all strings to numpy arrays:

type_dict = {
  "a": "int",
  "b": "int",
  "c": "float"
}

def make_split(text, dt):
    return np.fromstring(text, sep="|", dtype=dt)

df = df.apply(lambda x: x.apply(make_split, dt=type_dict[x.name]))

But this takes several minutes for my files. Is there a faster option?

CodePudding user response:

You can use converters parameter of .read_csv() to parse columns:

import pandas as pd
import numpy as np


df = pd.read_csv('data.csv', converters={
    'b': lambda x: np.array(x.split('|'), dtype='int'),
    'c': lambda x: np.array(x.split('|'), dtype='float')
    })
print(df)

Output:

   a          b                   c
0  1  [1, 2, 3]     [4.5, 5.5, 6.5]
1  2  [7, 8, 9]  [10.5, 11.5, 12.5]

CodePudding user response:

df=pd.read_csv(r'csv2.txt', sep=',')


df['b']=df['b'].str.split('|').apply(lambda x: [int(i) for i in x])
df['c']=df['c'].str.split('|').apply(lambda x: [float(i) for i in x])
df
a   b   c
0   1   [1, 2, 3]   [4.5, 5.5, 6.5]
1   2   [7, 8, 9]   [10.5, 11.5, 12.5]

CodePudding user response:

Does your data always respect the same format? I mean one value for a, 3 for band 3 for c?

If so, why just not change all the | for commas, and have 7 columns? a,b1,b2,b3,c1,c2,c3?

By doing so you are going to be able to avoid dealing with string when is clear that you data is numeric.

  • Related