I have a tuple and I need to convert it to dataframe.
res1_ = [
('z1', '1'),
('z1', '2'),
('x1', '1'),
('x2', '1'),
('x1', '3'),
('z1', '1')]
My expected dataframe should be like this :
docid secid
z1 [1,2]
x1 [1]
x2 [1]
x1 [3]
z1 [1]
If you note, the order is not changed and if docid get repeated in next row, then two secids are merged into a single list. Although x1 is occurring twice, sec id 1 and 3 are not in single list as we have docid x2 in mid of the x1s.
I tried with :
df = pd.DataFrame(res1_,columns=['docid','secid'])
df.groupby('docid')['secid'].apply(list)
But no luck as I am losing the order and x1 too is grouped.
Any pointers appreciated.
Thank you.
CodePudding user response:
You can use the DataFrame constructor, then GroupBy.agg
:
df = pd.DataFrame(res1_, columns=['docid', 'setid'])
group = df['docid'].ne(df['docid'].shift()).cumsum()
df = df.groupby(group.values).agg({'docid': 'first', 'setid': list})
output:
docid setid
1 z1 [1, 2]
2 x1 [1]
3 x2 [1]
4 x1 [3]
5 z1 [1]
CodePudding user response:
You could use itertools.groupby
to group the data, and then convert to a dataframe:
from itertools import groupby
grps = [(k, [t[1] for t in g]) for k, g in itertools.groupby(res1_, key=lambda x:x[0])]
df = pd.DataFrame(grps, columns=['docid', 'secid'])
Output:
docid secid
0 z1 [1, 2]
1 x1 [1]
2 x2 [1]
3 x1 [3]
4 z1 [1]