Jupyter notebooks / Pandas dataframe how to deal with multiple columns with similar values-CodePudding

I am looking for aspect on ways how to solve issue when I have data frame where I have collected data from multiple sources. For example I can have different filters like G,Z,Y but because I am collecting data from multiple different sources I might have multiple different Y columns for example Y1, Y2, Y3.

Then I will calculate a column that is G-Y.

id	G	Y1	Y2	Y3	Y-G
1	20	null	-99	0	NA
2	15	null	19	0	4
3	17	18	-99	19	1
4	19	19	-99	0	0

So first there can be multiple zero values like -99, 0 or NULL that all needs to be addressed. I would run the whole file and turn -99 and NULL values to 0 for example. But how?

Then the tricky part would be prioritize the correct pick of Y value if multiple values are available. For example, I want to use Y1 if I have that available, but If I don't have Y1 I will use Y2 or Y3 to calculate Y-G. (row3).

How would you approach this?

CodePudding user response：

You can perform the datacleaning step with a simple replace:

cleaned = df.replace(['null', -99, 0], nan).drop(columns=['Y-G'])
   id   G    Y1    Y2    Y3
0   1  20   NaN   NaN   NaN
1   2  15   NaN  19.0   NaN
2   3  17  18.0   NaN  19.0
3   4  19  19.0   NaN   NaN

Then you can use a filter & backfill across the columns to get the earliest seen value of Y. Then take that output and subtract column G from it for your calculated column:

cleaned['Y-G'] = cleaned.filter(like='Y').bfill(axis=1).iloc[:, 0] - cleaned['G']

print(cleaned)
   id   G    Y1    Y2    Y3  Y-G
0   1  20   NaN   NaN   NaN  NaN
1   2  15   NaN  19.0   NaN  4.0
2   3  17  18.0   NaN  19.0  1.0
3   4  19  19.0   NaN   NaN  0.0