I am doing the following with two dataframes but it generates duplicates and does not get sorted as the first dataframe.
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:10"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
result = pd.merge(df1, df2, on="time", how="left")
This generates the result of 8 rows! I am removing 3 digits from time column in df1
to match the time in df2
.
time value counts growth
0 15:09 10 fg 1.0
1 15:09 10 mn 3.0
2 15:09 20 fg 1.0
3 15:09 20 mn 3.0
4 15:10 30 gl 6.0
5 15:11 40 NaN NaN
6 15:12 50 NaN NaN
7 15:12 60 NaN NaN
There are duplicated columns due to join.
Is it possible to join the dataframes based on time
column in df1
where events are sorted well with more time granularity? Is there a way to partially match the time
column values of two dataframes and merge? Ideal result would look like the following
time value counts growth
0 15:09.123 10 fg 1.0
1 15:09.234 20 mn 3.0
2 15:10.123 30 gl 6.0
3 15:11.123 40 NaN NaN
4 15:12.123 50 NaN NaN
5 15:12.987 60 NaN NaN
CodePudding user response:
here is one way to do it
Assusmption: number of rows for a time without seconds in df1 and df2 will be same
# create time without seconds
df1['time2']=df1['time'].str[:-4]
# add a sequence when there are multiple rows for any time
df1['seq']=df1.groupby('time2')['time2'].cumcount()
# add a sequence when there are multiple rows for any time
df2['seq']=df2.groupby('time').cumcount()
# do a merge on time (stripped) in df1 and sequence
pd.merge(df1,
df2,
left_on=['time2', 'seq'],
right_on=['time','seq'],
how='left',
suffixes=(None,'_y')).drop(columns=['time2', 'seq'])
time value time_y counts growth
0 15:09.123 10 15:09 fg 1.0
1 15:09.234 20 15:09 mn 3.0
2 15:10.123 30 15:10 gl 6.0
3 15:11.123 40 NaN NaN NaN
4 15:12.123 50 NaN NaN NaN
5 15:12.987 60 NaN NaN NaN
CodePudding user response:
Merge on column 'time' with preserved order
Assumption: Data from df1 and df2 are in order of occurrence
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:11"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
df1_keys = df1["time"].unique()
df_list = list()
for key in df1_keys:
tmp_df1 = df1[df1["time"] == key]
tmp_df1 = tmp_df1.reset_index(drop=True)
tmp_df2 = df2[df2["time"] == key]
tmp_df2 = tmp_df2.reset_index(drop=True)
df_list.append(pd.merge(tmp_df1, tmp_df2, left_index=True, right_index=True, how="left"))
print(pd.concat(df_list, axis = 0))
CodePudding user response:
The code is doing exactly what it is supposed to do.
I think it's better for you to find a way to add those missing decimals on dict2 instead of removing 3 digits from dict1 to match them with each other because it will eliminate the data loss.
That way, the result will be exactly what you wanted.
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09.123", "15:09.234", "15:10.123"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
print(df1)
print(df2)
result = pd.merge(df1, df2, on="time", how="left")
print(result)
Output
time value
0 15:09.123 10
1 15:09.234 20
2 15:10.123 30
3 15:11.123 40
4 15:12.123 50
5 15:12.987 60
time counts growth
0 15:09.123 fg 1
1 15:09.234 mn 3
2 15:10.123 gl 6
time value counts growth
0 15:09.123 10 fg 1.0
1 15:09.234 20 mn 3.0
2 15:10.123 30 gl 6.0
3 15:11.123 40 NaN NaN
4 15:12.123 50 NaN NaN
5 15:12.987 60 NaN NaN