Home > Net >  Merging two datasets based on the 2 columns, or finding the missing values in first dataframe and fi
Merging two datasets based on the 2 columns, or finding the missing values in first dataframe and fi

Time:02-22

I have 2 pandas data-frames, with 2 columns namely index and date. Some of the dates are missing from the first data-frame, and those values can be obtained from the second data-frame corresponding to the index. I tried using pd.concat, pd.merge and pd.join etc but those doesn't seem to give me the results that I want. Here is the table.

df1 = data-frame 1

df2 = data-frame 2

CodePudding user response:

Have you tried df1 = df1.update(df2)?

Although the update funtion will not increase the size of df1, it only updates the missing values or the values that were already there.

CodePudding user response:

Since there was no reproducible dataframe, I tried the code below with running on generated data, but I think it would work fine for your code too:

import pandas as pd
df1 = pd.DataFrame({"date": [None, None, None, "01/01/2022"], "index":[402,402,403,404]})
df2 = pd.DataFrame({"date": ["16/05/2020", "18/07/2021", "13/08/2022", "26/07/2020"], "index":[402,405,403,404]})
df1.set_index("index", inplace=True)
df2.set_index("index", inplace=True)
for index, row in df1.iterrows():
  if row["date"] != row["date"] or row["date"] == None:
    df1.loc[index , "date"] = df2.loc[index]["date"]
df1

Output

index date
402 16/05/2020
402 16/05/2020
403 13/08/2022
404 01/01/2022

Note that row["date"] != row["date"] is used when a value of a cell is nan and has the type of float. nan values are not equal even to themselves!

CodePudding user response:

You can try this solution:

import pandas as pd
import numpy as np

# initialize list of lists
df1 = [[402, '15/05/2020'], [408, np.nan], [408, '14/05/2020']]
df2 = [[402, '16/05/2020'], [408, '10/05/2020'], [409, '13/05/2020']]

# Create the pandas DataFrame
df1 = pd.DataFrame(df1, columns=['index', 'date'])
df2 = pd.DataFrame(df2, columns=['index', 'date'])

df1.set_index("index", inplace=True)
df2.set_index("index", inplace=True)
for index, row in df1.iterrows():
    if row["date"] != row["date"]:
        row["date"] = df2.loc[index]["date"]

Output:

index            
402    15/05/2020
408    10/05/2020
408    14/05/2020

With this solution only the rows whose date is nan or null are updated with the corresponding value on the other dataframe.

  • Related