I have got 7 dataframes with same column names in python, but row indices are largely different and only few of the row indices match.
df1
col1 col2 col3
a 123 456 786
b 121 454 787
c 111 444 777
df2
col1 col2 col3
d 13 46 86
e 11 44 87
c 11 44 77
df3
col1 col2 col3
d 1231 4567 1786
g 1214 4546 1787
h 1115 4445 1777
c 12 12 10
Answer should be:
newdf
newcol col1 col2 col3 dfcol
c 111 444 777 1
c 11 44 77 2
c 12 12 10 3
d 13 46 86 2
d 1231 4567 1786 3
It is similar to this How to get the common index of two pandas dataframes? but not exactly.
CodePudding user response:
IIUC
This is my beginner way of doing it.
import pandas as pd
Create dataframe
df1 = pd.DataFrame({
'id': ['a', 'b', 'c'],
'col1': [123, 121, 111],
'col2': [456, 454, 444],
'col3': [786, 787, 777],
})
df2 = pd.DataFrame({
'id': ['d', 'e', 'c'],
'col1': [13, 46, 86],
'col2': [11, 44, 87],
'col3': [11, 44, 77],
})
df3 = pd.DataFrame({
'id': ['d', 'g', 'h', 'c'],
'col1': [1231, 1214, 1115, 12],
'col2': [4567, 4546, 4445, 12],
'col3': [1786, 1787, 1777, 10],
})
Create dataframe list as well as giving name to each dataframe
df_collection = [df1, df2, df3]
df1.name = 1
df2.name = 2
df3.name = 3
Create list:
id_col = [[df.loc[i, 'id'], df.loc[i, 'col1'], df.loc[i, 'col2'], df.loc[i, 'col3'], df.name] for df in df_collection for i in range(len(df))]
Construct new dataframe:
df = pd.DataFrame(id_col)
df.columns = ['id', 'col1', 'col2', 'col3', 'df']
Remove rows with only 1 occurence:
new_df = df[df.groupby('id').id.transform('count') > 1].sort_values('id')
This will give us the following:
id col1 col2 col3 df
2 c 111 444 777 1
5 c 86 87 77 2
9 c 12 12 10 3
3 d 13 11 11 2
6 d 1231 4567 1786 3