Home > Blockchain >  How to merge my dataframe to compensate for missing data without adding new columns?
How to merge my dataframe to compensate for missing data without adding new columns?

Time:04-01

I am trying to manipulate excel sheet data to automate a process on excel(not a developer), i do have 2 dataframes:

one looking like the below(only difference is more columns)

Date                           Val1    Val2
    0   2020-09-29 13:22:57       5.34      3.2
    1   2020-09-29 13:23:12        4.5       Nan
    2   2020-09-29 13:23:44        Nan      56.4
    3   2020-09-29 13:24:01        24        0.3

as we notice for the one above the index are in order, and all the fields have dates populated but do not necessarily have all the other columns populated.

the second dataframe have the following characteristics, equal or more rows without any additional dates, also no duplicate dates but the additional rows are empty ones(NaT for Date and Nan for all other columns), the index of df2 are also not in order due to other processes:

 Date                          Val1    Val2
0   2020-09-29 13:22:57         Nan      Nan
5   Nat                         Nan      Nan
1   2020-09-29 13:23:12         4.5       Nan
4    NaT                       Nan       Nan
6    Nat                       Nan       Nan
2   2020-09-29 13:23:44        Nan       Nan
3   2020-09-29 13:24:01        24        0.3

what i basically need is to check for the matching dates, and if a date in df2 matching a date in df1 to populate the same exact value for the entire row of that date in df2 as in df without changing the location of the empty rows in df2 or adding columns:

expected output:

 Date                             Val1         Val2
    0   2020-09-29 13:22:57       5.34        3.2
    5   Nat                         Nan      Nan
    1   2020-09-29 13:23:12         4.5       Nan
    4    NaT                       Nan       Nan
    6    Nat                       Nan       Nan
    2   2020-09-29 13:23:44        Nan       56.4
    3   2020-09-29 13:24:01        24        0.3

I've tried multiple approaches including:

data_frames = [df,df_2]

df_merged = reduce(lambda left, right: pd.merge(left, right, on=['Date'],
                                                   how='outer'), data_frames)
print(df_merged)

and also:

 df_f = pd.merge(df, df_2, on='Date', how='outer').fillna(method='ffill')

also tried changing the how to inner, left, right and so on but did not get the result i want i'd just get combined columns.

Edit:

df1 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', '2020-09-29 13:23:12', '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
                    'Val1': [5.34, 4.5, np.nan, 24],
                    'Val2': [3.2, np.nan, 56.4, 0.3]})

df2 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', np.nan,  '2020-09-29 13:23:12', np.nan, np.nan, '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
                    'Val1': [5.34, np.nan, 4.5, np.nan, np.nan, np.nan, 24],
                    'Val2': [3.2, np.nan, np.nan, np.nan, np.nan, 56.4, 0.3]},
                   index=[0,5,1,4,6,2,3])

f_f1 = df1.merge(df2["Date"], on="Date", how="right").set_index(df2.index)
print(f_f1)

CodePudding user response:

IIUC, try:

#convert to datetime if needed
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Date"] = pd.to_datetime(df2["Date"])

f_f1 = df1.merge(df2["Date"], on="Date", how="right").set_index(df2.index)

>>> df_f
                  Date   Val1  Val2
0  2020-09-29 13:22:57   5.34   3.2
5                  NaN    NaN   NaN
1  2020-09-29 13:23:12   4.50   NaN
4                  NaN    NaN   NaN
6                  NaN    NaN   NaN
2  2020-09-29 13:23:44    NaN  56.4
3  2020-09-29 13:24:01  24.00   0.3
Inputs:
df1 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', '2020-09-29 13:23:12', '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
                    'Val1': [5.34, 4.5, np.nan, 24],
                    'Val2': [3.2, np.nan, 56.4, 0.3]})

df2 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', np.nan,  '2020-09-29 13:23:12', np.nan, np.nan, '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
                    'Val1': [5.34, np.nan, 4.5, np.nan, np.nan, np.nan, 24],
                    'Val2': [3.2, np.nan, np.nan, np.nan, np.nan, 56.4, 0.3]},
                   index=[0,5,1,4,6,2,3])
  • Related