Convert tuples into grouped rows in dataframe without changing the order-CodePudding

I have a tuple and I need to convert it to dataframe.

res1_ =  [
  ('z1', '1'),
  ('z1', '2'),
  ('x1', '1'),
  ('x2', '1'),
  ('x1', '3'),
  ('z1', '1')]

My expected dataframe should be like this :

docid secid
z1    [1,2]
x1    [1]
x2    [1]
x1    [3]
z1    [1]

If you note, the order is not changed and if docid get repeated in next row, then two secids are merged into a single list. Although x1 is occurring twice, sec id 1 and 3 are not in single list as we have docid x2 in mid of the x1s.

I tried with :

df = pd.DataFrame(res1_,columns=['docid','secid'])
df.groupby('docid')['secid'].apply(list)

But no luck as I am losing the order and x1 too is grouped.

Any pointers appreciated.

Thank you.

CodePudding user response：

You can use the DataFrame constructor, then GroupBy.agg:

df = pd.DataFrame(res1_, columns=['docid', 'setid'])
group = df['docid'].ne(df['docid'].shift()).cumsum()
df = df.groupby(group.values).agg({'docid': 'first', 'setid': list})

output:

  docid   setid
1    z1  [1, 2]
2    x1     [1]
3    x2     [1]
4    x1     [3]
5    z1     [1]

CodePudding user response：

You could use itertools.groupby to group the data, and then convert to a dataframe:

from itertools import groupby 

grps = [(k, [t[1] for t in g]) for k, g in itertools.groupby(res1_, key=lambda x:x[0])]
df = pd.DataFrame(grps, columns=['docid', 'secid'])

Output:

  docid   secid
0    z1  [1, 2]
1    x1     [1]
2    x2     [1]
3    x1     [3]
4    z1     [1]