I have a dataframe as shown below. They are ordered Ascendingly by Column A and B. Only Occurrences >= 10 are valid, thus for rows with occurrences with less than 10, I want to replace their values with the next/closest valid row.
Column A | Column B | Occurrences | Value |
---|---|---|---|
Cell 1 | Cell 2 | 1 | 0 |
Cell 1 | Cell 3 | 2 | 0 |
Cell 1 | Cell 4 | 10 | 5 |
Cell 1 | Cell 5 | 1 | 1 |
Cell 1 | Cell 6 | 12 | 4 |
Cell 2 | Cell 1 | 1 | 7 |
Here is what the final dataframe should look like. I would like to do this in Bigquery but if its not possible, python would work as well.
Column A | Column B | Occurrences | Value |
---|---|---|---|
Cell 1 | Cell 2 | 1 | 5 |
Cell 1 | Cell 3 | 2 | 5 |
Cell 1 | Cell 4 | 10 | 5 |
Cell 1 | Cell 5 | 1 | 4 |
Cell 1 | Cell 6 | 12 | 4 |
Cell 2 | Cell 1 | 1 | 4 |
I have the dataframe all set up, but just having trouble figuring out the logic to apply this.
Logic:
- Start from the top and go through each row to check number of occurrences.
- If occurrences <10, look for the next valid row and take that value replace the non-valid row.
- If the last row is non-valid, it should take the value from previous row that is valid.
CodePudding user response:
Something like this should work in Python:
import pandas as pd
import numpy as np
# example dataframe
dict = {'Column A': ['Cell 1', 'Cell 1', 'Cell 1', 'Cell 1', 'Cell 1', 'Cell 2'],
'Column B': ['Cell 2', 'Cell 3', 'Cell 4', 'Cell 5', 'Cell 6', 'Cell 1'],
'Occurrences': [1, 2, 10, 1, 12, 1],
'Value': [0, 0, 5, 1, 4, 7]}
df = pd.DataFrame(dict)
# if Occurrences < 10, set Value to nan
df.loc[df['Occurrences'] < 10, 'Value'] = np.nan
# use interpolate and fillna to set missing (nan) values to nearest non-missing row value
# (first use bfill or "backfill" method, so take the value from the "next" valid row; then apply the ffill or
# "forwardfill" method for any rows at the end that were not filled)
df.loc[:, 'Value'] = df['Value'].fillna(method='bfill').fillna(method='ffill')
Note that if you just want to fill these values with the nearest row value, you can use