Creating new column with sets in Pandas DataFrame-CodePudding

Hello_world!

I have a DataFrame like this:

from pandas import DataFrame

df = DataFrame({"A": ['sd', 'df', 'gh', 'rv'],
                "B": ['hj', '4r', 'tg', '2s'],
                "C": ['hf', 'qw', 'e4', '7u'],
                "D": ['1q', 'nc', 'xf', '7y'],
                "E": ['9i', 'g7', 'ce', 'x3']})

    A   B   C   D   E
0   sd  hj  hf  1q  9i
1   df  4r  qw  nc  g7
2   gh  tg  e4  xf  ce
3   rv  2s  7u  7y  x3

I need to create a new column that will contain values of the set type, consisting of the values of the first five columns.

Expected result is:

    A   B   C   D   E   F
0   sd  hj  hf  1q  9i  {'sd','hj', 'hf', '1q', '9i'}
1   df  4r  qw  nc  g7  {'df','4r', 'qw', 'nc', 'g7'}
2   gh  tg  e4  xf  ce  {'gh','tg', 'e4', 'xf', 'ce'}
3   rv  2s  7u  7y  x3  {'rv','2s', '7u', '7y', 'x3'}

print(type(df.loc[0, 'F']))    # <class 'set'>
print(type(df.loc[0, 'A']))    # <class 'str'>

My code:

from pandas import DataFrame

df = DataFrame({"A": ['sd', 'df', 'gh', 'rv'],
                "B": ['hj', '4r', 'tg', '2s'],
                "C": ['hf', 'qw', 'e4', '7u'],
                "D": ['1q', 'nc', 'xf', '7y'],
                "E": ['9i', 'g7', 'ce', 'x3']})

f = {df.loc[0, 'A'], df.loc[0, 'B'], df.loc[0, 'C'], df.loc[0, 'D'], df.loc[0, 'E']}

df = df.assign(F = f)

print(df)

...have ValueError: Length of values (5) does not match length of index (4).

If I rewrite the code so that the length of the values matches the length of the index:

from pandas import DataFrame

df = DataFrame({"A": ['sd', 'df', 'gh', 'rv'],
                "B": ['hj', '4r', 'tg', '2s'],
                "C": ['hf', 'qw', 'e4', '7u'],
                "D": ['1q', 'nc', 'xf', '7y'],
                "E": ['9i', 'g7', 'ce', 'x3']})

f = {df.loc[0, 'A'], df.loc[0, 'B'], df.loc[0, 'C'], df.loc[0, 'D']}

df = df.assign(F = f)

print(df)

...I have TypeError: 'set' type is unordered.

I ask the respected community for help to solve my problem.

CodePudding user response：

Simply use:

df['F'] = df.apply(set, axis=1)

Note however that you have no control over the displayed order of sets as they are unordered containers.

Output:

    A   B   C   D   E                     F
0  sd  hj  hf  1q  9i  {hj, 1q, 9i, sd, hf}
1  df  4r  qw  nc  g7  {nc, df, qw, g7, 4r}
2  gh  tg  e4  xf  ce  {e4, xf, gh, ce, tg}
3  rv  2s  7u  7y  x3  {7u, x3, rv, 2s, 7y}

CodePudding user response：

Use List comprehension for better performance:

In [1069]: df['F'] = [set(i) for i in df.values]

In [1070]: df
Out[1070]: 
    A   B   C   D   E                     F
0  sd  hj  hf  1q  9i  {sd, 1q, hf, 9i, hj}
1  df  4r  qw  nc  g7  {g7, 4r, nc, df, qw}
2  gh  tg  e4  xf  ce  {xf, gh, ce, e4, tg}
3  rv  2s  7u  7y  x3  {2s, x3, 7y, rv, 7u}

OR as suggested by @jezrael:

df['F'] = [set(i) for i in df.to_numpy()]

Performance timings:

@mozway's solution:

In [1078]: %timeit df.apply(set, axis=1)
395 µs ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

My solution:

In [1079]: %timeit [set(i) for i in df.values]
7.3 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)