I am trying to add columns to the DataFrame based on the splitting of one column. With two rows, everything works and the empty column takes the value 'None'. The problem when I only have one row and the DataFrame cannot be expanded, and I would expect it to also be assigned the value 'None'.
Working example:
>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname_user','dbname']})
>>> df
auth
0 dbname_user
1 dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
>>> df
auth db login
0 dbname_user dbname user
1 dbname dbname None <--- as expected, 'None' value is assigned
Problematic example:
>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname']})
>>> df
auth
0 dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3643, in __setitem__
self._setitem_array(key, value)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
check_key_length(self.columns, key, value)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
I would expect the same as in the working example, where the value for the second column is 'None'. Unfortunately I cannot dynamically expand the number of columns using list comprehension. The number of columns must be fixed.
CodePudding user response:
This should do the trick:
import pandas as pd
import numpy as np
df_example = pd.DataFrame({'auth': ['dbname']})
df_example[["db", "login"]] = (
# This np.where adds a "_" to each string that don't have "_".
pd.Series(
np.where(
df_example["auth"].str.contains("_"),
df_example["auth"],
df_example["auth"] "_"
)
)
# Normal `.str.split` like you were doing.
.str.split("_", n=1, expand=True)
# `.applymap` converts empty strings ("") into None.
.applymap(lambda value: None if value == "" else value)
)
print(df_example)
# Prints:
# auth db login
# 0 dbname dbname None
Explanation
The error you're getting (ValueError: Columns must be same length as key
), happens because, after splitting the column "auth"
, you end up with only one value of length = 1. expand=True
won't help you here, because all the values from split have length = 1. Your first example works, because when pandas split the first value dbname_user
, it had length = 2, therefore the remaining values got expanded to this same length. In other words, expand=True
makes all the returning values have the same length as the value with the greatest length:
# Series with first value containing no "_",
# second value containing one "_",
# and third value containing two "_".
example_2 = pd.Series(['dbname', 'dbname_user', 'dbname_user_2'])
# Applying `.str.split`, without setting parameter `n`
split = example_2.str.split("_", expand=True)
# Get the maximum length of the split (max_len = 3)
max_len = example_2.str.split("_").str.len().max()
print('Max Length:', max_len)
print('Number of columns:', split.shape[1])
print('max_len == split.shape[1]:', max_len == split.shape[1])
print('\n', split)
# Prints:
# Max Length: 3
# Number of columns: 3
# max_len == split.shape[1]: True
#
# 0 1 2
# 0 dbname None None
# 1 dbname user None
# 2 dbname user 2
CodePudding user response:
You can try this piece of code which should work :
def underscore_split(auth):
if "_" in auth:
return auth.split("_", 1)
else:
return [auth, None]
df[['db', 'login']] = [underscore_split(x) for x in df["auth"]]