How to update records to their parent record's values on multiple conditions?-CodePudding

Let's say I have a dataframe like this:

import pandas as pd

df = pd.DataFrame([[1,2,3,"P", 1, "A", "SOMETHING"],
                  [1,2,3,"C", 0, "B", "NOTHING"],
                  [1,2,3,"C", 0, "B", "SOMETHING"],
                  [4,5,6,"P", 1, "A", "SOMETHING"],
                  [4,5,6,"C", 1, "A", "NOTHING"]],
 columns=["ID1", "ID2", "ID3", "FLAG", "CONDITION_1", "CONDITION_2", "DATA_FIELD"])

   ID1  ID2  ID3 FLAG  CONDITION_1 CONDITION_2 DATA_FIELD
0    1    2    3    P            1           A  SOMETHING
1    1    2    3    C            0           B    NOTHING
2    1    2    3    C            0           B  SOMETHING
3    4    5    6    P            1           A  SOMETHING
4    4    5    6    C            1           A    NOTHING

There is a FLAG column which has 2 types of values:

P: Parent
C: Child

When the ID1, ID2, ID3 values are repeated, this means those records are connected. There is always one P and can be any number of C in the FLAG column.

What I want to achieve is to update the all child record's DATA_FIELD value to the parent record's value if the following conditions met: CONDITION_1 == 0 AND CONDITION_2 == "B"

This would give the following result:

   ID1  ID2  ID3 FLAG  CONDITION_1 CONDITION_2 DATA_FIELD
0    1    2    3    P            1           A  SOMETHING
1    1    2    3    C            0           B  SOMETHING
2    1    2    3    C            0           B  SOMETHING
3    4    5    6    P            1           A  SOMETHING
4    4    5    6    C            1           A    NOTHING

What I had in mind is to sort the values in an ascending order by ID1,ID2,ID3 and descending by FLAG. After that I could loop through the dataframe line by line, check if the FLAG is a P and store the 3 IDs and the DATA_FIELD value in a dictionary or something. On the next line I need to check if the keys are the same, then if the conditions are True, then update the DATA_FIELD. Not sure however if this is the best solution:

df.sort_values(["ID1", "ID2", "ID3", "FLAG"],
                                ascending=[True, True, True, False], inplace=True)
df.reset_index(drop=True, inplace=True)
parent_keys = None
for index, row in df.iterrows():
    if row["FLAG"] == "P":
        parent_keys = f"{row['ID1']}{row['ID2']}{row['ID3']}"
        parent_data_field_value = row["DATA_FIELD"]
    if row["FLAG"] == "C":
        if parent_keys:
            child_keys = f"{row['ID1']}{row['ID2']}{row['ID3']}"
            if child_keys == parent_keys:
                if row["CONDITION_1"] == 0 and row["CONDITION_2"] == "B":
                    df.loc[index, "DATA_FIELD"] = parent_data_field_value

CodePudding user response：

You can avoid sorting the dataframe, and create a dictionary of key values for the parent ids and then modify the data field to update children accordingly while respecting the conditions, the code to achieve this would look like this :

import pandas as pd

df = pd.DataFrame([[1,2,3,"P", 1, "A", "SOMETHING"],
                  [1,2,3,"C", 0, "B", "NOTHING"],
                  [1,2,3,"C", 0, "B", "SOMETHING"],
                  [4,5,6,"P", 1, "A", "SOMETHING"],
                  [4,5,6,"C", 1, "A", "NOTHING"]],
 columns=["ID1", "ID2", "ID3", "FLAG", "CONDITION_1", "CONDITION_2", "DATA_FIELD"])

parents_values = {f"{row['ID1']}{row['ID2']}{row['ID3']}" : row["DATA_FIELD"] for index,row in df.loc[df["FLAG"] == "P",["ID1","ID2","ID3","DATA_FIELD"]].iterrows()}

def update_child(row):
    if row["FLAG"] == "C" and row["CONDITION_1"] == 0 and row["CONDITION_2"] == "B":
        return parents_values[f"{row['ID1']}{row['ID2']}{row['ID3']}"] 
    return row["DATA_FIELD"]

df["DATA_FIELD"] = df.apply(lambda x: update_child(x), axis=1)

CodePudding user response：

Rather than using an intermediary dictionary you can use a DataFrame directly by setting your keys as the index.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1,2,3,"P", 1, "A", "TO_PROPAGATE"],
   ...:                    [1,2,3,"C", 0, "B", "NOTHING"],
   ...:                    [1,2,3,"C", 0, "B", "SOMETHING"],
   ...:                    [4,5,6,"P", 1, "A", "SOMETHING"],
   ...:                    [4,5,6,"C", 1, "A", "NOTHING"]],
   ...:          columns=["ID1", "ID2", "ID3",
   ...:                   "FLAG", "CONDITION_1", "CONDITION_2", "DATA_FIELD"])

In [3]: # use ID<n> columns as index
   ...: df = df.set_index(['ID1', 'ID2', 'ID3'])

In [4]: # isolate parent `DATA_FIELD`s for later use as replacements
   ...: s_parent_data_field = df[df['FLAG'] == 'P']['DATA_FIELD']

In [5]: # identify rows meeting condition (eval used for illustration)
   ...: meets_condition = df.eval('FLAG == "C" & CONDITION_1 == 0 & CONDITION_2 == "B"')

In [6]: # replace child DATA_FIELD where condition is met
   ...: df['DATA_FIELD'] = df['DATA_FIELD'].mask(meets_condition, s_parent_data_field)

In [7]: df = df.reset_index()
   ...: df
Out[7]:
   ID1  ID2  ID3 FLAG  CONDITION_1 CONDITION_2    DATA_FIELD
0    1    2    3    P            1           A  TO_PROPAGATE
1    1    2    3    C            0           B  TO_PROPAGATE
2    1    2    3    C            0           B  TO_PROPAGATE
3    4    5    6    P            1           A     SOMETHING
4    4    5    6    C            1           A       NOTHING

Note that this assumes there exists a 'P' record corresponding to each 'C' record. o.w. this will blow up.