Combining Pandas DataFrame Columns with Alternate Spelling-CodePudding

I have a DataFrame that was imported from a json file. Part of the data in the json file includes alternate spellings for some string/categorical properties resulting in columns with similar names where values are populated in one, the other, or neither. I want to be able to combine the columns with alternate spellings into a single column. Ideally I would go to the source, the json file, and fix it. In this situation I am not able to change the source, I can only use it.

Ideally this doesn't happen with looping because for the real DataFrame there is a lot of data and that would be too slow.

Is condition

df_is = pd.DataFrame({"C1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "C2": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"],
        "C3": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
        "C4": ["S", np.NaN, "F", np.NaN, np.NaN, "S", "F", np.NaN, np.NaN, np.NaN],
        "c4": [np.NaN, "S", np.NaN, np.NaN, "S", np.NaN, np.NaN, np.NaN, np.NaN, "F"]})
display(df_is)

    C1  C2  C3  C4  c4
0   1   a   A   S   NaN
1   2   b   B   NaN S
2   3   c   C   F   NaN
3   4   d   D   NaN NaN
4   5   e   E   NaN S
5   6   f   F   S   NaN
6   7   g   G   F   NaN
7   8   h   H   NaN NaN
8   9   i   I   NaN NaN
9   10  j   J   NaN F

Should be condition

    C1  C2  C3  C4
0   1   a   A   S
1   2   b   B   S
2   3   c   C   F
3   4   d   D   NaN
4   5   e   E   S
5   6   f   F   S
6   7   g   G   F
7   8   h   H   NaN
8   9   i   I   NaN
9   10  j   J   F

CodePudding user response：

Here is a solution. In a real solution I would probably overwrite column with the right spelling and then drop the other, but for this version I made a new column to show the old and combined columns side by side.

Note that the fill_value setting is important in dealing with NaN in the original columns when there are string values because trying to add a float like np.NaN to a string will result in a TypeError.

Code

df_solution["C4_Fixed"] = df_solution["C4"].add(df_solution["c4"], fill_value="")
display(df_solution)

Output

    C1  C2  C3  C4  c4  C4_Fixed
0   1   a   A   S   NaN S
1   2   b   B   NaN S   S
2   3   c   C   F   NaN F
3   4   d   D   NaN NaN NaN
4   5   e   E   NaN S   S
5   6   f   F   S   NaN S
6   7   g   G   F   NaN F
7   8   h   H   NaN NaN NaN
8   9   i   I   NaN NaN NaN
9   10  j   J   NaN F   F

Code with Column Replacement

df_solution["C4"] = df_solution["C4"].add(df_solution["c4"], fill_value="")
df_solution.drop(["c4"], axis=1, inplace= True)
display(df_solution)

Output with Column Replacement

    C1  C2  C3  C4
0   1   a   A   S
1   2   b   B   S
2   3   c   C   F
3   4   d   D   NaN
4   5   e   E   S
5   6   f   F   S
6   7   g   G   F
7   8   h   H   NaN
8   9   i   I   NaN
9   10  j   J   F

CodePudding user response：

I'd do:

df_is.loc[df_is.C4.isna(), "C4"] = df_is.c4

Result:

   C1 C2 C3   C4   c4
0   1  a  A    S  NaN
1   2  b  B    S    S
2   3  c  C    F  NaN
3   4  d  D  NaN  NaN
4   5  e  E    S    S
5   6  f  F    S  NaN
6   7  g  G    F  NaN
7   8  h  H  NaN  NaN
8   9  i  I  NaN  NaN
9  10  j  J    F    F