How to merge two dataframes without generating extra rows in result?-CodePudding

I am doing the following with two dataframes but it generates duplicates and does not get sorted as the first dataframe.

import pandas as pd

dict1 = {
    "time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
    "value": [10, 20, 30, 40, 50, 60]
}

dict2 = {
    "time": ["15:09", "15:09", "15:10"],
    "counts": ["fg", "mn", "gl"],
    "growth": [1, 3, 6]
}

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)

df1["time"] = df1["time"].str[:-4]

result = pd.merge(df1, df2, on="time", how="left")

This generates the result of 8 rows! I am removing 3 digits from time column in df1 to match the time in df2.

    time  value counts  growth
0  15:09     10     fg     1.0
1  15:09     10     mn     3.0
2  15:09     20     fg     1.0
3  15:09     20     mn     3.0
4  15:10     30     gl     6.0
5  15:11     40    NaN     NaN
6  15:12     50    NaN     NaN
7  15:12     60    NaN     NaN

There are duplicated columns due to join.

Is it possible to join the dataframes based on time column in df1 where events are sorted well with more time granularity? Is there a way to partially match the time column values of two dataframes and merge? Ideal result would look like the following

   time       value  counts  growth
0  15:09.123   10     fg     1.0
1  15:09.234   20     mn     3.0
2  15:10.123   30     gl     6.0
3  15:11.123   40     NaN    NaN
4  15:12.123   50     NaN    NaN
5  15:12.987   60     NaN    NaN

CodePudding user response：

here is one way to do it

Assusmption: number of rows for a time without seconds in df1 and df2 will be same

# create time without seconds
df1['time2']=df1['time'].str[:-4]

# add a sequence when there are multiple rows for any time
df1['seq']=df1.groupby('time2')['time2'].cumcount()

# add a sequence when there are multiple rows for any time
df2['seq']=df2.groupby('time').cumcount()

# do a merge on time (stripped) in df1 and sequence
pd.merge(df1, 
         df2,
        left_on=['time2', 'seq'],
        right_on=['time','seq'],
        how='left',
        suffixes=(None,'_y')).drop(columns=['time2', 'seq'])

    time     value  time_y  counts  growth
0   15:09.123   10  15:09   fg      1.0
1   15:09.234   20  15:09   mn      3.0
2   15:10.123   30  15:10   gl      6.0
3   15:11.123   40  NaN     NaN     NaN
4   15:12.123   50  NaN     NaN     NaN
5   15:12.987   60  NaN     NaN     NaN

CodePudding user response：

Merge on column 'time' with preserved order
Assumption: Data from df1 and df2 are in order of occurrence

import pandas as pd

dict1 = {
    "time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
    "value": [10, 20, 30, 40, 50, 60]
}

dict2 = {
    "time": ["15:09", "15:09", "15:11"],
    "counts": ["fg", "mn", "gl"],
    "growth": [1, 3, 6]
}

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)

df1["time"] = df1["time"].str[:-4]

df1_keys = df1["time"].unique()
df_list = list()
for key in df1_keys:
    tmp_df1 = df1[df1["time"] == key]
    tmp_df1 = tmp_df1.reset_index(drop=True)
    tmp_df2 = df2[df2["time"] == key]
    tmp_df2 = tmp_df2.reset_index(drop=True)
    df_list.append(pd.merge(tmp_df1, tmp_df2, left_index=True, right_index=True, how="left"))

print(pd.concat(df_list, axis = 0))

CodePudding user response：

The code is doing exactly what it is supposed to do.

I think it's better for you to find a way to add those missing decimals on dict2 instead of removing 3 digits from dict1 to match them with each other because it will eliminate the data loss.

That way, the result will be exactly what you wanted.

import pandas as pd

dict1 = {
    "time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
    "value": [10, 20, 30, 40, 50, 60]
}

dict2 = {
    "time": ["15:09.123", "15:09.234", "15:10.123"],
    "counts": ["fg", "mn", "gl"],
    "growth": [1, 3, 6]
}

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)


print(df1)
print(df2)


result = pd.merge(df1, df2, on="time", how="left")

print(result)

Output

        time  value
0  15:09.123     10
1  15:09.234     20
2  15:10.123     30
3  15:11.123     40
4  15:12.123     50
5  15:12.987     60
        time counts  growth
0  15:09.123     fg       1
1  15:09.234     mn       3
2  15:10.123     gl       6
        time  value counts  growth
0  15:09.123     10     fg     1.0
1  15:09.234     20     mn     3.0
2  15:10.123     30     gl     6.0
3  15:11.123     40    NaN     NaN
4  15:12.123     50    NaN     NaN
5  15:12.987     60    NaN     NaN