I have comma separated datasets with more than 50k rows like this:
items
a,e
b,b,d
c
d,e,a,c,b
c,a,c,e
b,d,a,c,e
a,d
d,d
I want to add them to different columns using pandas:
item1 item2 item3 item4 item5
a e NaN NaN NaN
b b d NaN NaN
c NaN NaN NaN NaN
d e a c b
c a c e NaN
b d a c e
a d NaN NaN NaN
d d NaN NaN NaN
CodePudding user response:
A simple solution is to split the column, convert it to a list and then create a new dataframe. Then rename the columns from the default values:
df = pd.DataFrame(df['items'].str.split(',').values.tolist())
df.columns = 1
df = df.add_prefix('item_')
Result:
item_0 item_1 item_2 item_3 item_4
0 a e None None None
1 b b d None None
2 c None None None None
3 d e a c b
4 c a c e None
5 b d a c e
6 a d None None None
7 d d None None None
CodePudding user response:
You could split your file by \n first like this
data = '''a,e
b,b,d
c
d,e,a,c,b
c,a,c,e
b,d,a,c,e
a,d
d,d'''
lines=data.split("\n")
then create a empty Dataframe
df = pd.DataFrame(columns=['item1','item2','item3','item4','item5'])
and add data line by line
for line in lines:
line_data = line.split(",")
line_data = (5-len(line_data))*[None]
df = pd.concat( [pd.DataFrame([line_data], columns=df.columns), df] , ignore_index=True)
output:
item1 item2 item3 item4 item5
0 d d None None None
1 a d None None None
2 b d a c e
3 c a c e None
4 d e a c b
5 c None None None None
6 b b d None None
7 a e None None None
8 d d None None None
9 a d None None None
10 b d a c e
11 c a c e None
12 d e a c b
13 c None None None None
14 b b d None None
15 a e None None None
16 d d None None None
17 a d None None None
18 b d a c e
19 c a c e None
20 d e a c b
21 c None None None None
22 b b d None None
23 a e None None None