I am looking for aspect on ways how to solve issue when I have data frame where I have collected data from multiple sources. For example I can have different filters like G,Z,Y but because I am collecting data from multiple different sources I might have multiple different Y columns for example Y1, Y2, Y3.
Then I will calculate a column that is G-Y.
id | G | Y1 | Y2 | Y3 | Y-G |
---|---|---|---|---|---|
1 | 20 | null | -99 | 0 | NA |
2 | 15 | null | 19 | 0 | 4 |
3 | 17 | 18 | -99 | 19 | 1 |
4 | 19 | 19 | -99 | 0 | 0 |
So first there can be multiple zero values like -99, 0 or NULL that all needs to be addressed. I would run the whole file and turn -99 and NULL values to 0 for example. But how?
Then the tricky part would be prioritize the correct pick of Y value if multiple values are available. For example, I want to use Y1 if I have that available, but If I don't have Y1 I will use Y2 or Y3 to calculate Y-G. (row3).
How would you approach this?
CodePudding user response:
You can perform the datacleaning step with a simple replace:
cleaned = df.replace(['null', -99, 0], nan).drop(columns=['Y-G'])
id G Y1 Y2 Y3
0 1 20 NaN NaN NaN
1 2 15 NaN 19.0 NaN
2 3 17 18.0 NaN 19.0
3 4 19 19.0 NaN NaN
Then you can use a filter & backfill across the columns to get the earliest seen value of Y. Then take that output and subtract column G from it for your calculated column:
cleaned['Y-G'] = cleaned.filter(like='Y').bfill(axis=1).iloc[:, 0] - cleaned['G']
print(cleaned)
id G Y1 Y2 Y3 Y-G
0 1 20 NaN NaN NaN NaN
1 2 15 NaN 19.0 NaN 4.0
2 3 17 18.0 NaN 19.0 1.0
3 4 19 19.0 NaN NaN 0.0