Home > Software design >  Pandas compare 2 dataframe columns and make them same length
Pandas compare 2 dataframe columns and make them same length

Time:10-12

I want to compare two pandas list columns and compare their lengths and if one column has greater length than the other, I want to pad it with None value. Is there a pandas function that can do that?

Example, if I have this df below:

df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road'], ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962']]
    }
)

After applying said function, I want my resulting df to look like:

result_df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road', None], ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962', None, None]]
    }
)

Any missing element from either column gets padded with a None value. How can I do this?

Please advise.

CodePudding user response:

One can use pandas.DataFrame.apply with a custom lambda function as follows

df['house_id'] = df.apply(lambda x: x['house_id']   [None] * (len(x['address']) - len(x['house_id'])), axis=1)

[Out]:
                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                      [789 Road]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

Then for the column address one can use a similar logic

df['address'] = df.apply(lambda x: x['address']   [None] * (len(x['house_id']) - len(x['address'])), axis=1)

[Out]:

                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                [789 Road, None]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

Notes:

  • Basically what the lambda function does, for the first case, is to add a None value to the column house_id until the length of the house_id column is equal to the length of the address column.

CodePudding user response:

For improve performance avoid DataFrame.apply, here is solution with list comprehension and custom function, also is possible working with multiple columns defined in list cols:

cols = ['address','house_id']

def f(x):
    max_len = len(max(x, key=len))
    return [y   [None] * (max_len - len(y)) for y in x]


df[cols] = [f(x)  for x in df[cols].to_numpy()]
print (df)

                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                [789 Road, None]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road'], 
                    ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962']],
        "house_id1": [['456'], ['789', '987'], []]
    }
)

cols = ['address','house_id', 'house_id1']

def f(x):
    max_len = len(max(x, key=len))
    return [y   [None] * (max_len - len(y)) for y in x]


df[cols] = [f(x)  for x in df[cols].to_numpy()]
print (df)

                          address           house_id           house_id1
0            [123 Road, 456 Road]         [123, 456]         [456, None]
1                [789 Road, None]         [789, 987]          [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]  [None, None, None]

CodePudding user response:

A somewhat simpler, even though less elegant way is this:

import pandas as pd

df = pd.DataFrame(
    {
        "address": [['123 Road', '456 Road'], ['789 Road'], ['962 Road', '875 Road', '777 Road']],
        "house_id": [['123', '456'], ['789', '987'], ['962']]
    }
)

for index in range(len(df)):
    len_diff=abs(len(df.loc[index,"address"])-len(df.loc[index,"house_id"]))
    if len_diff!=0:
        if len(df.loc[index,"address"])<len(df.loc[index,"house_id"]):
            df.loc[index,"address"]=df.loc[index,"address"] ["None"]*len_diff
        else:
            df.loc[index,"house_id"]=df.loc[index,"house_id"] ["None"]*len_diff
print(df)

result_df
                          address           house_id
0            [123 Road, 456 Road]         [123, 456]
1                [789 Road, None]         [789, 987]
2  [962 Road, 875 Road, 777 Road]  [962, None, None]

CodePudding user response:

What about something like this?

dft = df.T   # transpose
for col in dft.columns:
    dft[col] = dft[col].apply(pd.Series).values.tolist()
result_df = dft.T    # transpose back

Basically, the .apply(pd.Series) makes multiple columns out of one list column, padding with nan's as necessary. Since we want to pad row-wise, the transpose is necessary.

  • Related