How can I locate common substrings in pandas Dataframe column?-CodePudding

I'm new to Python and I would much appreciate any tips and hints on how to locate common substring across a column of my dataframe.

I have a dataset with video-channels and users who watched videos there, and I want to find out which channels share the same audience and locate the IDs of the watchers.

My data looks like this:

d = {'channel':[1, 2, 3], 'users':[['uid01', 'uid03'], ['uid02', 'uid03', 'uid07'], ['uid06', 'uid01']]}
df = pd.DataFrame(data=d)

Data Example

Indeed the number of IDs is much higher, going up to several hundreds for most of the channels. The number of channels is limited to just 9.

I expect to have separate columns for each channel, with values represented by IDs of the users who watched both of the channels. The expected result is like this:

result = {'channel':[1, 2, 3], 
      'users':[['uid01, uid03'], ['uid02', 'uid03', 'uid07'], ['uid06', 'uid01']], 
      '1': [['uid01', 'uid03'], ['uid03'], ['uid01']],
      '2': [['uid03'], ['uid02', 'uid03', 'uid07'], []],
      '3': [['uid01'], [], ['uid06', 'uid01']]}
df_result = pd.DataFrame(data=result)

Expected result

My instant idea was to solve it with intersection of sets, but I'm at loss about the way I should do it. So far all I managed to achieve is a function to add a single column:

def intersection(df, c):
intersecting= []
for n in range(0, 3):
    intersecting.append(set(df.users[n]).intersection(df.users[c]))
col = np.array(intersecting)
df['1'] = col
return df

However this function only works when I call it directly as a standalone function, and fails to execute when I use it with Pandas apply(). I would much appreciate your suggestions!

CodePudding user response：

You want to create a new column for every row of the original dataframe, so the most straightforward approach is to iterate over the rows and create these new columns one by one:

for c, u in zip(df.channel, df.users):
    df[str(c)] = df.users.apply(lambda x: sorted(list(set(x).intersection(set(u)))))