Home > Software engineering >  Python merge files into one and regroup rows based on previous labels in their own file
Python merge files into one and regroup rows based on previous labels in their own file

Time:03-24

I have multiple files that need to be merged into one big file, but their group IDs should be modified based on the new generated file. See the following example:

dataframe 1: df1.txt

name group_id
A    1
B    1
C    1
D    2   

dataframe 2: df2.txt

name group_id
E    1
F    1
G    2
H    3   

dataframe 3: df3.txt

name group_id
I    1
J    2 
K    2
L    3   

The final output is expected to be:

name group_id
A    1
B    1
C    1
D    2   
E    3
F    3
G    4
H    5
I    6
J    7 
K    7
L    8

The group_id column should be re-labeled in the final data frame but still based on their own group_ids in previous data frame.

For the code, I've tried to merge them into one but can't figure out how to label the group id part.

df = pd.concat(map(pd.read_csv, glob.glob('df*.txt')))

Thanks in advance!

Test data code if it can save you some time:

df1 = pd.DataFrame([['A', 1], ['B', 1], ['C', 1], ['D', 2]], columns=('name','group_id'))
df2 = pd.DataFrame([['E', 1], ['F', 1], ['G', 2], ['H', 3]], columns=('name','group_id'))
df3 = pd.DataFrame([['I', 1], ['J', 2], ['K', 2], ['L', 3]], columns=('name','group_id'))

CodePudding user response:

We have to reassign the group numbers after concatenation, this can be done using groupby ngroup which basically assigns the group number starting from 0 to number of groups - 1

df = pd.concat(map(pd.read_csv, glob.glob('df*.txt')))
df['group_id'] = df.groupby([(df.index == 0).cumsum(), 'group_id']).ngroup().add(1)

Result

print(df)

  name  group_id
0    A         1
1    B         1
2    C         1
3    D         2
0    E         3
1    F         3
2    G         4
3    H         5
0    I         6
1    J         7
2    K         7
3    L         8
  • Related