Pandas compare 2 dataframe columns and make them same length-CodePudding

I want to compare two pandas list columns and compare their lengths and if one column has greater length than the other, I want to pad it with None value. Is there a pandas function that can do that?

Example, if I have this df below:

df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road'], ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962']]
    }
)

After applying said function, I want my resulting df to look like:

result_df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road', None], ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962', None, None]]
    }
)

Any missing element from either column gets padded with a None value. How can I do this?

Please advise.

CodePudding user response：

One can use pandas.DataFrame.apply with a custom lambda function as follows

df['house_id'] = df.apply(lambda x: x['house_id']   [None] * (len(x['address']) - len(x['house_id'])), axis=1)

[Out]:
                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                      [789 Road]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

Then for the column address one can use a similar logic

df['address'] = df.apply(lambda x: x['address']   [None] * (len(x['house_id']) - len(x['address'])), axis=1)

[Out]:

                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                [789 Road, None]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

Notes:

Basically what the lambda function does, for the first case, is to add a None value to the column house_id until the length of the house_id column is equal to the length of the address column.

CodePudding user response：

For improve performance avoid DataFrame.apply, here is solution with list comprehension and custom function, also is possible working with multiple columns defined in list cols:

cols = ['address','house_id']

def f(x):
    max_len = len(max(x, key=len))
    return [y   [None] * (max_len - len(y)) for y in x]


df[cols] = [f(x)  for x in df[cols].to_numpy()]
print (df)

                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                [789 Road, None]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road'], 
                    ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962']],
        "house_id1": [['456'], ['789', '987'], []]
    }
)

cols = ['address','house_id', 'house_id1']

def f(x):
    max_len = len(max(x, key=len))
    return [y   [None] * (max_len - len(y)) for y in x]


df[cols] = [f(x)  for x in df[cols].to_numpy()]
print (df)

                          address           house_id           house_id1
0            [123 Road, 456 Road]         [123, 456]         [456, None]
1                [789 Road, None]         [789, 987]          [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]  [None, None, None]

CodePudding user response：

A somewhat simpler, even though less elegant way is this:

import pandas as pd

df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road'], ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962']]
    }
)

for index in range(len(df)):
    len_diff=abs(len(df.loc[index,"address"])-len(df.loc[index,"house_id"]))
    if len_diff!=0:
        if len(df.loc[index,"address"])<len(df.loc[index,"house_id"]):
            df.loc[index,"address"]=df.loc[index,"address"] ["None"]*len_diff
        else:
            df.loc[index,"house_id"]=df.loc[index,"house_id"] ["None"]*len_diff
print(df)

result_df
                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                [789 Road, None]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

CodePudding user response：

What about something like this?

dft = df.T   # transpose
for col in dft.columns:
    dft[col] = dft[col].apply(pd.Series).values.tolist()
result_df = dft.T    # transpose back

Basically, the .apply(pd.Series) makes multiple columns out of one list column, padding with nan's as necessary. Since we want to pad row-wise, the transpose is necessary.