Concatenating Two Columns in Pandas Using the Index-CodePudding

I have three Pandas data frames consisting of the id of patients being used as their index, and time series values of different patients in order of their time of measurement. All of these patients have the same number of measurements (there are two values for each patient). I want to create a third data frame which just concatenates these data frames. Catch: Not all patients are represented in all data frames. The final data frame should contain only the patients represented in ALL three data frames. An example for the data frames (please note there's three in total):

id	value1
1	80
1	78
2	76
2	79

id	value2
2	65
2	67
3	74
3	65

What I'm trying to create:

id	value1	value2
2	76	65
2	79	67

I tried:

data = pd.merge(A, B, on="stay_id")

But the result is:

id	value1	value2
2	76	65
2	76	67
2	79	65
2	79	67

So the first value gets repeated along the axis. I also tried:

complete = A.copy()
complete["B" = B["value2"]

Does this ensure the values being matched for the id?

CodePudding user response：

I think what you are looking for is to have all entries in a single axis but only for common values for column "stay_id". In that case first find the common indexes and the use concat after filtering the data frames by the common values found for the column "stay_id"

df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
frames = [df1, df2]

common_ids = list(set.intersection(*(set(df["stay_id"]) for df in frames)))
frames_filer = [df[df["stay_id"].isin(common_ids)] for df in frames]

out = pd.concat(frames_filer)
print(out)

Output:


    stay_id     value
0      2        65
1      2        67
2      2        76
3      2        79

CodePudding user response：

If I understand correctly, first start by making the dataframes have the same columns names by using pandas.DataFrame.set_axis and then, concatenate those dataframes with the help of pandas.concat. Finally, use a boolean mask to keep only the rows with an id figuring in all the dataframes.

Considering there is a third dataframe (called dfC), you can try the code below :

id	value3
2	72
2	83
4	78
4	76

list_df = [dfA, dfB, dfC]
out = pd.concat([df.set_axis(['id', 'value'], axis=1) for df in list_df], ignore_index=True)
out = out[out.id.isin(list(set.intersection(*(set(df["id"]) for df in list_df))))]

`>>> print(out)`

   id  value
2   2     76
3   2     79
4   2     65
5   2     67
8   2     72
9   2     83