I'm new to Python and I would much appreciate any tips and hints on how to locate common substring across a column of my dataframe.
I have a dataset with video-channels and users who watched videos there, and I want to find out which channels share the same audience and locate the IDs of the watchers.
My data looks like this:
d = {'channel':[1, 2, 3], 'users':[['uid01', 'uid03'], ['uid02', 'uid03', 'uid07'], ['uid06', 'uid01']]}
df = pd.DataFrame(data=d)
Indeed the number of IDs is much higher, going up to several hundreds for most of the channels. The number of channels is limited to just 9.
I expect to have separate columns for each channel, with values represented by IDs of the users who watched both of the channels. The expected result is like this:
result = {'channel':[1, 2, 3],
'users':[['uid01, uid03'], ['uid02', 'uid03', 'uid07'], ['uid06', 'uid01']],
'1': [['uid01', 'uid03'], ['uid03'], ['uid01']],
'2': [['uid03'], ['uid02', 'uid03', 'uid07'], []],
'3': [['uid01'], [], ['uid06', 'uid01']]}
df_result = pd.DataFrame(data=result)
My instant idea was to solve it with intersection of sets, but I'm at loss about the way I should do it. So far all I managed to achieve is a function to add a single column:
def intersection(df, c):
intersecting= []
for n in range(0, 3):
intersecting.append(set(df.users[n]).intersection(df.users[c]))
col = np.array(intersecting)
df['1'] = col
return df
However this function only works when I call it directly as a standalone function, and fails to execute when I use it with Pandas apply()
. I would much appreciate your suggestions!
CodePudding user response:
You want to create a new column for every row of the original dataframe, so the most straightforward approach is to iterate over the rows and create these new columns one by one:
for c, u in zip(df.channel, df.users):
df[str(c)] = df.users.apply(lambda x: sorted(list(set(x).intersection(set(u)))))