Python: in dataframe, how to concatenate a column across rows (col. being list of lists)?-CodePudding

I have a Dataframe as shown below as dataframe1:

dataframe1

ID | PATH

ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach]]

ABC [[apple, plum, peach], [apple, pear, peach]]

BCD [[blueberry, plum, peach], [pear, apple, peach]]

BCD [[plum, apple, peach], [banana, raspberry, peach]]

-------------------------------------------------------------------

I would like to concatenate the value of column 'PATH' by column 'ID' and have the following results:

dataframe2 (ideal output, the output is the list of lists)

ID | PATH

ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]

BCD [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]

I used the following code:

df2 = df1.groupby('id')['PATH'].apply(list)

but got the results with [[[ ]]] as shown which is list of list of lists...and is not what I want.

dataframe3 (wrong output)

ID | PATH

ABC [ [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]]

BCD [[[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]]

can someone tell me what I need to change to have results like in dataframe2?

Thank you!

CodePudding user response：

If im interpreting this properly you might want to look into the 'merge' function. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

CodePudding user response：

Try:

df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()

df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()

Full working example:

data=[
  ["ABC", [["orange", "apple", "kiwi", "peach"], ["strawberry", "orange", "kiwi", "peach"]]],
  ["ABC", [["apple", "plum", "peach"], ["apple", "pear", "peach"]]],
  ["BCD", [["blueberry", "plum", "peach"], ["pear", "apple", "peach"]]],
  ["BCD", [["plum", "apple", "peach"], ["banana", "raspberry", "peach"]]],
]

columns = ["ID", "PATH"]

df = pd.DataFrame(data=data, columns=columns)

df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
#OR
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()

[Out]:
 ID                                                                                                           PATH
ABC   [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]
BCD              [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]