I have a Dataframe as shown below as dataframe1:
dataframe1
ID | PATH
ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach]]
ABC [[apple, plum, peach], [apple, pear, peach]]
BCD [[blueberry, plum, peach], [pear, apple, peach]]
BCD [[plum, apple, peach], [banana, raspberry, peach]]
-------------------------------------------------------------------
I would like to concatenate the value of column 'PATH' by column 'ID' and have the following results:
dataframe2 (ideal output, the output is the list of lists)
ID | PATH
ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]
BCD [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]
I used the following code:
df2 = df1.groupby('id')['PATH'].apply(list)
but got the results with [[[ ]]] as shown which is list of list of lists...and is not what I want.
dataframe3 (wrong output)
ID | PATH
ABC [ [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]]
BCD [[[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]]
can someone tell me what I need to change to have results like in dataframe2?
Thank you!
CodePudding user response:
If im interpreting this properly you might want to look into the 'merge' function. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
CodePudding user response:
Try:
df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
or
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()
Full working example:
data=[
["ABC", [["orange", "apple", "kiwi", "peach"], ["strawberry", "orange", "kiwi", "peach"]]],
["ABC", [["apple", "plum", "peach"], ["apple", "pear", "peach"]]],
["BCD", [["blueberry", "plum", "peach"], ["pear", "apple", "peach"]]],
["BCD", [["plum", "apple", "peach"], ["banana", "raspberry", "peach"]]],
]
columns = ["ID", "PATH"]
df = pd.DataFrame(data=data, columns=columns)
df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
#OR
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()
[Out]:
ID PATH
ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]
BCD [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]