I have a DataFrame that was imported from a json file. Part of the data in the json file includes alternate spellings for some string/categorical properties resulting in columns with similar names where values are populated in one, the other, or neither. I want to be able to combine the columns with alternate spellings into a single column. Ideally I would go to the source, the json file, and fix it. In this situation I am not able to change the source, I can only use it.
Ideally this doesn't happen with looping because for the real DataFrame there is a lot of data and that would be too slow.
Is condition
df_is = pd.DataFrame({"C1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"C2": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"],
"C3": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
"C4": ["S", np.NaN, "F", np.NaN, np.NaN, "S", "F", np.NaN, np.NaN, np.NaN],
"c4": [np.NaN, "S", np.NaN, np.NaN, "S", np.NaN, np.NaN, np.NaN, np.NaN, "F"]})
display(df_is)
C1 C2 C3 C4 c4
0 1 a A S NaN
1 2 b B NaN S
2 3 c C F NaN
3 4 d D NaN NaN
4 5 e E NaN S
5 6 f F S NaN
6 7 g G F NaN
7 8 h H NaN NaN
8 9 i I NaN NaN
9 10 j J NaN F
Should be condition
C1 C2 C3 C4
0 1 a A S
1 2 b B S
2 3 c C F
3 4 d D NaN
4 5 e E S
5 6 f F S
6 7 g G F
7 8 h H NaN
8 9 i I NaN
9 10 j J F
CodePudding user response:
Here is a solution. In a real solution I would probably overwrite column with the right spelling and then drop the other, but for this version I made a new column to show the old and combined columns side by side.
Note that the fill_value setting is important in dealing with NaN in the original columns when there are string values because trying to add a float like np.NaN to a string will result in a TypeError.
Code
df_solution["C4_Fixed"] = df_solution["C4"].add(df_solution["c4"], fill_value="")
display(df_solution)
Output
C1 C2 C3 C4 c4 C4_Fixed
0 1 a A S NaN S
1 2 b B NaN S S
2 3 c C F NaN F
3 4 d D NaN NaN NaN
4 5 e E NaN S S
5 6 f F S NaN S
6 7 g G F NaN F
7 8 h H NaN NaN NaN
8 9 i I NaN NaN NaN
9 10 j J NaN F F
Code with Column Replacement
df_solution["C4"] = df_solution["C4"].add(df_solution["c4"], fill_value="")
df_solution.drop(["c4"], axis=1, inplace= True)
display(df_solution)
Output with Column Replacement
C1 C2 C3 C4
0 1 a A S
1 2 b B S
2 3 c C F
3 4 d D NaN
4 5 e E S
5 6 f F S
6 7 g G F
7 8 h H NaN
8 9 i I NaN
9 10 j J F
CodePudding user response:
I'd do:
df_is.loc[df_is.C4.isna(), "C4"] = df_is.c4
Result:
C1 C2 C3 C4 c4
0 1 a A S NaN
1 2 b B S S
2 3 c C F NaN
3 4 d D NaN NaN
4 5 e E S S
5 6 f F S NaN
6 7 g G F NaN
7 8 h H NaN NaN
8 9 i I NaN NaN
9 10 j J F F