my_df = pd.DataFrame(data = { 'col0': ['1%2%', '3%4%'], 'col1': [np.nan, np.nan], 'col2': [' 1-2', ' 2-4'], 'col3': [np.nan, ' 2-4'], 'col4': [np.nan, np.nan] })
my_df[['col0a', 'col0b']] = my_df['col0'].fillna('%').str.split('%', expand = True).iloc[:, 0:2]
my_df[['col1a', 'col1b']] = my_df['col1'].fillna('%').str.split('%', expand = True).iloc[:, 0:2]
my_df[['col2a', 'col2b']] = my_df['col2'].str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df[['col3a', 'col3b']] = my_df['col3'].str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df[['col4a', 'col4b']] = my_df['col4'].str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df
This throws an error on the line my_df[['col4a', 'col4b']]
since the entire column is NaN
values. For col1
, there is no error thrown since we fillna with %
, and the split then splits on the %
resulting in two columns with empty strings, which is exactly what we want.
What fillna()
can we add to col4
such that it won't throw an error? The split for col4 is more complex (looking for the pattern of #-#) than the split for col1. We've tried fillna(' -')
but this does not work. In general, we are looking for a solution that outputs empty strings for col4a
and col4b
when col4
has NaN
as its value.
Edit: worded differently, I need a string for fillna that, when passed to .split('(?<=[\d]|K)(?=[ -])', expand=True)
, would result in 2 empty strings, although perhaps I can replace with something like fillna( 9999-9999)
and then replace 9999 with empty strings...
CodePudding user response:
A (probably) better solution is to use .str.extract
instead of .str.split
, which is better-suited for your purposes:
my_df = pd.DataFrame(data = { 'col0': ['1%2%', '3%4%'], 'col1': [np.nan, np.nan], 'col2': [' 1-2', ' 2-4'], 'col3': [np.nan, ' 2-4'], 'col4': [np.nan, np.nan] })
my_df = my_df.fillna('')
pat1 = r'(\d %)(\d %)'
pat2 = r'([ -](?:[\d] |K))([ -](?:[\d] |K))'
my_df[['col0a', 'col0b']] = my_df['col0'].str.extract(pat1)
my_df[['col1a', 'col1b']] = my_df['col1'].str.extract(pat1)
my_df[['col2a', 'col2b']] = my_df['col2'].str.extract(pat2)
my_df[['col3a', 'col3b']] = my_df['col3'].str.extract(pat2)
my_df[['col4a', 'col4b']] = my_df['col4'].str.extract(pat2)
my_df = my_df.fillna('')
Output:
>>> my_df
col0 col1 col2 col3 col4 col0a col0b col1a col1b col2a col2b col3a col3b col4a col4b
0 1%2% 1-2 1% 2% 1 -2
1 3%4% 2-4 2-4 3% 4% 2 -4 2 -4
CodePudding user response:
Not perfect, but this 2-line solution is better than the current 8-line solution I have that uses an if: else: block.
my_df[['col4a', 'col4b']] = my_df['col4'].fillna(' 999999-999999').str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df[['col4a', 'col4b']] = my_df[['col4a', 'col4b']].replace(' 999999', '').replace('-999999', '')