Home > database >  Joining lists with pandas
Joining lists with pandas

Time:11-17

I have a large messy SQL database which I have started cleaning up and I need to check if user IDs exist in the tables, preferably in one view. I have started with left joins, listing all user IDs but the performance is really poor with 5 tables in my case.

So I have decided to use python and get the IDs in lists and join them together in a pandas dataframe. My only problem is that I cannot really figure out how to do left joins with python, the same way I would do with SQL.

I have 10 lists with user IDs and ideally, I need a dataframe where the index is all user IDs and the columns represent the tables indicating whether the user ID can be found (YES / NO).

Here is what I managed to do:

users = pd.DataFrame(users)
table1 = pd.DataFrame(table1)

users = users.merge(table, how='left', indicator='table1')

Then I would need to loop through all the lists and merge them with users, rename the indicator texts like left_only = 'No' both = 'Yes' and reindex() the final table.

Any suggestions on how to do this in a more elegant way?

CodePudding user response:

Okay so I figured it out. The dataset I wanted to have would have looked like this:

users   table1   table2   table3
-----   ------   ------   ------
1       True     None     None
2       True     True     None
3       None     None     True
4       None     None     None
5       True     True     True

There is no need to use pandas, it's better to use plain python to merge the columns.

users = [1, 2, 3, 4, 5]
tables = [table1, table2, table3]

data = {
   'users': users
   }

for table in tables:
   column = []

   for user in users:
      if user in table:
          column.append(True)
      else:
          column.append(None)

   data[table] = column

And then to put it into a dataframe

df = pd.DataFrame.from_dict(data)
df.sort_values('users').reset_index(drop=True)
  • Related