I have this data.
df1 = pd.DataFrame({"user": [1942, 95870, 85228, 6636],
"0": [1524, 8788, 9899, 27172],
"1": [1333, 4476, 78783, 90832],
"2": [2021, 2022, 34522, 38479]})
df1.set_index(df1['user'])
print(df1)
user 0 1 2
0 1942 1524 1333 2021
1 95870 8788 4476 2022
2 85228 9899 78783 34522
3 6636 27172 90832 38479
df1
and df2
have the same number of rows (real data has 1 million) and each row for df2
consist of several numbers in square brackets, separated by a comma as follows:
df2 = pd.DataFrame({
"0": [[1123, 2021, 8788]]})
The numbers in square brackets vary from 1 to 20, in this case it's only 3. However, the number rows for df2
and df1
are the same.
print(df2)
0
0 [1123, 2021, 8788]
What I would like to do is to pick a user from df1
, for example user1942
and compare if any of the three numbers in that row (1524, 1333, 2021)
are in any of the corresponding square brackets in df2 [1123, 2021, 8788]
. A boolean of 1 if True or 0 otherwise, or any other representation will be helpful.
Update: The comparison should be row by row. That is one row in df1
versus it's corresponding row in df2
.
The output could look like this, or anything closer: Output: Since 2021 is in df2, the answer is True. Then move to the next user in df1, compare that user's 3 numbers with the corresponding row numbers in df2, etc.
user status
1942 1
...
I know that there are a number of questions about comparing multiple columns but I could not find something reproducible or similar to this context.
CodePudding user response:
Assuming the indices of df1 are the same of the ones of df2 and that the relation between the rows of the two is 1:1, I would do something like this:
df3 = pd.DataFrame()
for index, row in df1.iterrows():
to_find = [row[1], row[2], row[3]]
to_check = df2.iloc[index][0]
if [True for i in to_find if i in to_check]:
df3 = df3.append([1])
else:
df3 = df3.append([0])
you create a list for the values in df1 and then ckeck if any of the values appear in the list in the corresponding df2 row, if at least one value matches you append 1 to the resulting df3, otherwise 0. Always assuming your relation is 1:1, now in df3 you have the index corresponding to the user selected and the boolean value to verify the match.
CodePudding user response:
I have yet to find a very elegant solution without relying on for loops, but if we convert the dataframes to np.arrays (or even lists) first, we can get a very decent workaround.
First, we manipulate df1
to obtain organized records
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"user": [1123, 95870, 85228, 6636],
"0": [1524, 8788, 9899, 27172],
"1": [1333, 4476, 78783, 90832],
"2": [2021, 2022, 34522, 38479]})
df1 = df1.set_index('user', drop=True)
print(df1)
0 1 2
user
1123 1524 1333 2021
95870 8788 4476 2022
85228 9899 78783 34522
6636 27172 90832 38479
Then, we do the same to df2
. The proposed method works even if the two dataframes have a different number of columns.
df2 = pd.DataFrame([[1123, 2021, 8788, 6636],
[1333, 2023, 4477, 78783],
[1524, 2023, 9899, 27172],
[2021, 2023, 345233,38479]]
)
df2 = pd.DataFrame(df2.to_numpy().tolist())
df2 = df2.set_axis(df1.index, axis=0)
print(df2)
0 1 2 3
user
1123 1123 1333 8788 6636
95870 1333 2023 4477 78783
85228 1524 2023 9899 27172
6636 2021 2023 345233 38479
Finally, we produce the desired values using a list comprehension that iterates over rows to check whether they have elements in common.
inter = [np.isin(arr[0], arr[1]).any() for arr in zip(df1.to_numpy(), df2.to_numpy())]
print(inter)
[True, False, True, True]