ValueError: Columns must be same length as key - when working on one row only-CodePudding

I am trying to add columns to the DataFrame based on the splitting of one column. With two rows, everything works and the empty column takes the value 'None'. The problem when I only have one row and the DataFrame cannot be expanded, and I would expect it to also be assigned the value 'None'.

Working example:

>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname_user','dbname']})
>>> df
          auth
0  dbname_user
1       dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
>>> df
          auth      db login
0  dbname_user  dbname  user
1       dbname  dbname  None   <--- as expected, 'None' value is assigned

Problematic example:

>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname']})
>>> df
     auth
0  dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3643, in __setitem__
    self._setitem_array(key, value)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

I would expect the same as in the working example, where the value for the second column is 'None'. Unfortunately I cannot dynamically expand the number of columns using list comprehension. The number of columns must be fixed.

CodePudding user response：

This should do the trick:

import pandas as pd
import numpy as np


df_example = pd.DataFrame({'auth': ['dbname']})
df_example[["db", "login"]] = (
    # This np.where adds a "_" to each string that don't have "_".
    pd.Series(
        np.where(
            df_example["auth"].str.contains("_"),
            df_example["auth"],
            df_example["auth"]   "_"
        )
    )
    # Normal `.str.split` like you were doing.
    .str.split("_", n=1, expand=True)
    # `.applymap` converts empty strings ("") into None.
    .applymap(lambda value: None if value == "" else value)
)
print(df_example)
# Prints:
#      auth      db login
# 0  dbname  dbname  None

Explanation

The error you're getting (ValueError: Columns must be same length as key), happens because, after splitting the column "auth", you end up with only one value of length = 1. expand=True won't help you here, because all the values from split have length = 1. Your first example works, because when pandas split the first value dbname_user, it had length = 2, therefore the remaining values got expanded to this same length. In other words, expand=True makes all the returning values have the same length as the value with the greatest length:


# Series with first value containing no "_",
# second value containing one "_",
# and third value containing two "_".
example_2 = pd.Series(['dbname', 'dbname_user', 'dbname_user_2'])

# Applying `.str.split`, without setting parameter `n`
split = example_2.str.split("_", expand=True)
# Get the maximum length of the split (max_len = 3)
max_len = example_2.str.split("_").str.len().max()

print('Max Length:', max_len)
print('Number of columns:', split.shape[1])
print('max_len == split.shape[1]:', max_len == split.shape[1])
print('\n', split)
# Prints:
# Max Length: 3
# Number of columns: 3
# max_len == split.shape[1]: True
# 
#         0     1     2
# 0  dbname  None  None
# 1  dbname  user  None
# 2  dbname  user     2

CodePudding user response：

You can try this piece of code which should work :

def underscore_split(auth):
    if "_" in auth:
        return auth.split("_", 1)
    else:
        return [auth, None]


df[['db', 'login']] = [underscore_split(x) for x in df["auth"]]