Home > front end >  Comparing multiple columns to a list with multiple values
Comparing multiple columns to a list with multiple values

Time:12-04

I have this data.

df1 = pd.DataFrame({"user": [1942,  95870, 85228,  6636],
                   "0": [1524, 8788, 9899, 27172],
                   "1": [1333, 4476, 78783, 90832],
                   "2": [2021, 2022, 34522, 38479]})
df1.set_index(df1['user'])
print(df1)

    user      0      1      2
0   1942   1524   1333   2021
1  95870   8788   4476   2022
2  85228   9899  78783  34522
3   6636  27172  90832  38479

df1 and df2 have the same number of rows (real data has 1 million) and each row for df2 consist of several numbers in square brackets, separated by a comma as follows:

df2 = pd.DataFrame({
                   "0": [[1123, 2021, 8788]]})

The numbers in square brackets vary from 1 to 20, in this case it's only 3. However, the number rows for df2 and df1 are the same.

print(df2)
           0
0  [1123, 2021, 8788]

What I would like to do is to pick a user from df1, for example user1942 and compare if any of the three numbers in that row (1524, 1333, 2021) are in any of the corresponding square brackets in df2 [1123, 2021, 8788]. A boolean of 1 if True or 0 otherwise, or any other representation will be helpful.

Update: The comparison should be row by row. That is one row in df1 versus it's corresponding row in df2.

The output could look like this, or anything closer: Output: Since 2021 is in df2, the answer is True. Then move to the next user in df1, compare that user's 3 numbers with the corresponding row numbers in df2, etc.

user      status
1942        1
...

I know that there are a number of questions about comparing multiple columns but I could not find something reproducible or similar to this context.

CodePudding user response:

Assuming the indices of df1 are the same of the ones of df2 and that the relation between the rows of the two is 1:1, I would do something like this:

df3 = pd.DataFrame()
  for index, row in df1.iterrows():
  to_find = [row[1], row[2], row[3]]
  to_check = df2.iloc[index][0]
  if [True for i in to_find if i in to_check]:
    df3 = df3.append([1])
  else:
    df3 = df3.append([0])

you create a list for the values in df1 and then ckeck if any of the values appear in the list in the corresponding df2 row, if at least one value matches you append 1 to the resulting df3, otherwise 0. Always assuming your relation is 1:1, now in df3 you have the index corresponding to the user selected and the boolean value to verify the match.

CodePudding user response:

I have yet to find a very elegant solution without relying on for loops, but if we convert the dataframes to np.arrays (or even lists) first, we can get a very decent workaround. First, we manipulate df1 to obtain organized records

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"user": [1123,  95870, 85228,  6636],
                   "0": [1524, 8788, 9899, 27172],
                   "1": [1333, 4476, 78783, 90832],
                   "2": [2021, 2022, 34522, 38479]})
df1 = df1.set_index('user', drop=True)
print(df1)
           0      1      2
user                      
1123    1524   1333   2021
95870   8788   4476   2022
85228   9899  78783  34522
6636   27172  90832  38479

Then, we do the same to df2. The proposed method works even if the two dataframes have a different number of columns.

df2 = pd.DataFrame([[1123, 2021, 8788, 6636],
                    [1333, 2023, 4477, 78783],
                    [1524, 2023, 9899, 27172],
                    [2021, 2023, 345233,38479]]
                    )

df2 = pd.DataFrame(df2.to_numpy().tolist())
df2 = df2.set_axis(df1.index, axis=0)
print(df2)
          0     1       2      3
user                            
1123   1123  1333    8788   6636
95870  1333  2023    4477  78783
85228  1524  2023    9899  27172
6636   2021  2023  345233  38479

Finally, we produce the desired values using a list comprehension that iterates over rows to check whether they have elements in common.

inter = [np.isin(arr[0], arr[1]).any() for arr in zip(df1.to_numpy(), df2.to_numpy())]
print(inter)
[True, False, True, True]

  • Related