I am a Python noob. I have an unstructured text file that I'm trying to capture to a dataframe and export to excel. I need to merge 38 to 36, 45 to 43, and 79 to 78 filling in the empty space with the data on the merging column.
Dummy Dataset
0 | 5 | 36 | 38 | 43 | 45 | 78 | 79 | |
---|---|---|---|---|---|---|---|---|
1 | A | 01JUN2022 | 1.2 | B | 1.2 | |||
2 | C | 01JUN2022 | 1.4 | D | 1.4 | |||
3 | E | 01JUN2022 | 1.5 | F | 1.6 | |||
4 | G | 01JUN2022 | 1.7 | H | 1.7 | |||
5 | I | 01JUN2022 | 1.4 | J | 1.8 | |||
6 | K | 01JUN2022 | 1.7 | L | 1.3 | |||
1 | A | 01JUN2022 | 1.2 | B | 1.2 | |||
2 | C | 01JUN2022 | 1.4 | D | 1.4 | |||
3 | E | 01JUN2022 | 1.5 | F | 1.6 | |||
4 | G | 01JUN2022 | 1.7 | H | 1.7 | |||
5 | I | 01JUN2022 | 1.4 | J | 1.8 | |||
6 | K | 01JUN2022 | 1.7 | L | 1.3 |
Required output
0 | 5 | 36 | 43 | 79 | |
---|---|---|---|---|---|
1 | A | 01JUN2022 | 1.2 | B | 1.2 |
2 | C | 01JUN2022 | 1.4 | D | 1.4 |
3 | E | 01JUN2022 | 1.5 | F | 1.6 |
4 | G | 01JUN2022 | 1.7 | H | 1.7 |
5 | I | 01JUN2022 | 1.4 | J | 1.8 |
6 | K | 01JUN2022 | 1.7 | L | 1.3 |
1 | A | 01JUN2022 | 1.2 | B | 1.2 |
2 | C | 01JUN2022 | 1.4 | D | 1.4 |
3 | E | 01JUN2022 | 1.5 | F | 1.6 |
4 | G | 01JUN2022 | 1.7 | H | 1.7 |
5 | I | 01JUN2022 | 1.4 | J | 1.8 |
6 | K | 01JUN2022 | 1.7 | L | 1.3 |
CodePudding user response:
Would start by converting ''
to NaN
as follows
df = df.replace(r'^\s*$', np.nan, regex=True)
Then one can use pandas.Series.combine_first
df['36'] = df['36'].combine_first(df['38'])
df['43'] = df['43'].combine_first(df['45'])
df['79'] = df['79'].combine_first(df['78'])
[Out]:
id 0 5 36 38 43 45 78 79
0 1 A 01JUN2022 1.2 1.2 B B NaN 1.2
1 2 C 01JUN2022 1.4 1.4 D D NaN 1.4
2 3 E 01JUN2022 1.5 NaN F NaN 1.6 1.6
3 4 G 01JUN2022 1.7 NaN H NaN 1.7 1.7
4 5 I 01JUN2022 1.4 NaN J NaN 1.8 1.8
5 6 K 01JUN2022 1.7 NaN L NaN 1.3 1.3
6 1 A 01JUN2022 1.2 1.2 B B NaN 1.2
7 2 C 01JUN2022 1.4 1.4 D NaN 1.4 1.4
8 3 E 01JUN2022 1.5 1.5 F NaN 1.6 1.6
9 4 G 01JUN2022 1.7 NaN H NaN 1.7 1.7
10 5 I 01JUN2022 1.4 NaN J J NaN 1.8
11 6 K 01JUN2022 1.7 NaN L NaN NaN 1.3
Finally, one can drop the columns that one doesn't want or select the one's to display as follows
df = df[['0', '5', '36', '43', '79']]
[Out]:
0 5 36 43 79
0 A 01JUN2022 1.2 B 1.2
1 C 01JUN2022 1.4 D 1.4
2 E 01JUN2022 1.5 F 1.6
3 G 01JUN2022 1.7 H 1.7
4 I 01JUN2022 1.4 J 1.8
5 K 01JUN2022 1.7 L 1.3
6 A 01JUN2022 1.2 B 1.2
7 C 01JUN2022 1.4 D 1.4
8 E 01JUN2022 1.5 F 1.6
9 G 01JUN2022 1.7 H 1.7
10 I 01JUN2022 1.4 J 1.8
11 K 01JUN2022 1.7 L 1.3
and this gives the desired output.
Notes:
There are additional ways to remove
NaN
and combine the columns. For that, see: How to remove nan value while combining two column in Panda Data frame?There are also additional ways to delete columns in Python. If one wants to learn more about it, see: Delete a column from a Pandas DataFrame