I have the following dataframe
column1 column2
0 Paul xx
1 John aa
2 Paul gg
3 John xx
4 John bb
5 George gg
6 Paul gg
7 john xx
.
n Jonathan ff
I want to have the information of each person in one row. On the same row
I want to have the index
but in another column
. So I want a dataframe
like this:
column1 column2 column3
0 Paul 0,2,6 xx, gg, gg
1 John 1,3,4,7 aa, xx, bb, xx
5 George 5 gg
.
.
.
n Jonathan n ff
In order to make the above dataframe
i execute
df2 = df.reset_index().groupby('column1').agg(list).reset_index()
ix = pd.Index(df2['index'].str.get(0)).rename(None)
df3 = df2.set_index(ix).sort_index()
df3
Which returns:
column1 index column2
0 Paul [0, 2, 6] [xx, gg, gg]
1 John [1, 3, 4, 7] [aa, xx, bb,xx]
5 George [5] [gg]
After that, I delete column1
and index
.
To have the values of column2
in a format, not represented as list
I execute:
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(x))
return df3
dfb=transform_list(df3)
df3.head()
which return:
column2
0 xx, gg, gg
1 aa, xx, bb,xx
5 gg
So now what I want is to have the unique values of each row
so my final dataframe
will be
column2
0 xx, gg
1 aa, xx, bb
5 gg
Any ideas?
CodePudding user response:
You could transform each list to a set and then back to a list, to eliminate duplicate entries. This could be done within your lambda function:
import pandas as pd
df = pd.DataFrame({'column2': [['xx', 'gg', 'gg'],
['aa', 'xx', 'bb', 'xx'],
['gg']]},
index=[0, 1, 5])
df['column2'] = df.column2.apply(lambda x: ', '.join(list(set(x))))
df
column2
0 gg, xx
1 bb, xx, aa
5 gg
CodePudding user response:
As long as the order of the elements in your output doesn't matter, you can redefine your function as follows:
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(set(x)))
return df3
A set contains inherently only unique elements, so converting the list x
to a set will discard any duplicates. Sets are inherently unordered however, so you may get unintended results if order matters.
If order does matter, you can use the version
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(list(dict.fromkeys(x))))
return df3
This creates a dictionary (which is insertion ordered) with keys from your initial list x
, and since the keys can't be multiply defined, we end up with only the unique elements. Converting back to a list takes the keys from the dictionary, and the rest of the workflow can continue as needed without alteration.