I have multiple files that need to be merged into one big file, but their group IDs should be modified based on the new generated file. See the following example:
dataframe 1: df1.txt
name group_id
A 1
B 1
C 1
D 2
dataframe 2: df2.txt
name group_id
E 1
F 1
G 2
H 3
dataframe 3: df3.txt
name group_id
I 1
J 2
K 2
L 3
The final output is expected to be:
name group_id
A 1
B 1
C 1
D 2
E 3
F 3
G 4
H 5
I 6
J 7
K 7
L 8
The group_id column should be re-labeled in the final data frame but still based on their own group_ids in previous data frame.
For the code, I've tried to merge them into one but can't figure out how to label the group id part.
df = pd.concat(map(pd.read_csv, glob.glob('df*.txt')))
Thanks in advance!
Test data code if it can save you some time:
df1 = pd.DataFrame([['A', 1], ['B', 1], ['C', 1], ['D', 2]], columns=('name','group_id'))
df2 = pd.DataFrame([['E', 1], ['F', 1], ['G', 2], ['H', 3]], columns=('name','group_id'))
df3 = pd.DataFrame([['I', 1], ['J', 2], ['K', 2], ['L', 3]], columns=('name','group_id'))
CodePudding user response:
We have to reassign the group numbers after concatenation, this can be done using groupby
ngroup
which basically assigns the group number starting from 0 to number of groups - 1
df = pd.concat(map(pd.read_csv, glob.glob('df*.txt')))
df['group_id'] = df.groupby([(df.index == 0).cumsum(), 'group_id']).ngroup().add(1)
Result
print(df)
name group_id
0 A 1
1 B 1
2 C 1
3 D 2
0 E 3
1 F 3
2 G 4
3 H 5
0 I 6
1 J 7
2 K 7
3 L 8