I have some csv files including array columns. For example:
a,b,c
1,1|2|3,4.5|5.5|6.5
2,7|8|9,10.5|11.5|12.5
Delimter 1 is ,
to sepperate fields a, b and c. Delimiter 2 is |
in this case, but could be changed.
Is there a possibility in python to read this as a pandas dataframe directly? Field b and c should be an array/series inside the dataframe.
What I do now is reading the csv as strings:
df = pd.read_csv('data.csv', dtype='str')
Then use np.fromstring
to convert all strings to numpy arrays:
type_dict = {
"a": "int",
"b": "int",
"c": "float"
}
def make_split(text, dt):
return np.fromstring(text, sep="|", dtype=dt)
df = df.apply(lambda x: x.apply(make_split, dt=type_dict[x.name]))
But this takes several minutes for my files. Is there a faster option?
CodePudding user response:
You can use converters
parameter of .read_csv()
to parse columns:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', converters={
'b': lambda x: np.array(x.split('|'), dtype='int'),
'c': lambda x: np.array(x.split('|'), dtype='float')
})
print(df)
Output:
a b c
0 1 [1, 2, 3] [4.5, 5.5, 6.5]
1 2 [7, 8, 9] [10.5, 11.5, 12.5]
CodePudding user response:
df=pd.read_csv(r'csv2.txt', sep=',')
df['b']=df['b'].str.split('|').apply(lambda x: [int(i) for i in x])
df['c']=df['c'].str.split('|').apply(lambda x: [float(i) for i in x])
df
a b c
0 1 [1, 2, 3] [4.5, 5.5, 6.5]
1 2 [7, 8, 9] [10.5, 11.5, 12.5]
CodePudding user response:
Does your data always respect the same format?
I mean one value for a
, 3 for b
and 3 for c
?
If so, why just not change all the |
for commas, and have 7 columns?
a,b1,b2,b3,c1,c2,c3?
By doing so you are going to be able to avoid dealing with string when is clear that you data is numeric.