Python pandas use fillna() to avoid error splitting on NaN value-CodePudding

my_df = pd.DataFrame(data = { 'col0': ['1%2%', '3%4%'], 'col1': [np.nan, np.nan], 'col2': [' 1-2', ' 2-4'], 'col3': [np.nan, ' 2-4'], 'col4': [np.nan, np.nan] })

my_df[['col0a', 'col0b']] = my_df['col0'].fillna('%').str.split('%', expand = True).iloc[:, 0:2]
my_df[['col1a', 'col1b']] = my_df['col1'].fillna('%').str.split('%', expand = True).iloc[:, 0:2]
my_df[['col2a', 'col2b']] = my_df['col2'].str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df[['col3a', 'col3b']] = my_df['col3'].str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df[['col4a', 'col4b']] = my_df['col4'].str.split('(?<=[\d]|K)(?=[ -])', expand=True)

my_df

This throws an error on the line my_df[['col4a', 'col4b']] since the entire column is NaN values. For col1, there is no error thrown since we fillna with %, and the split then splits on the % resulting in two columns with empty strings, which is exactly what we want.

What fillna() can we add to col4 such that it won't throw an error? The split for col4 is more complex (looking for the pattern of #-#) than the split for col1. We've tried fillna(' -') but this does not work. In general, we are looking for a solution that outputs empty strings for col4a and col4b when col4 has NaN as its value.

Edit: worded differently, I need a string for fillna that, when passed to .split('(?<=[\d]|K)(?=[ -])', expand=True), would result in 2 empty strings, although perhaps I can replace with something like fillna( 9999-9999) and then replace 9999 with empty strings...

CodePudding user response：

A (probably) better solution is to use .str.extract instead of .str.split, which is better-suited for your purposes:

my_df = pd.DataFrame(data = { 'col0': ['1%2%', '3%4%'], 'col1': [np.nan, np.nan], 'col2': [' 1-2', ' 2-4'], 'col3': [np.nan, ' 2-4'], 'col4': [np.nan, np.nan] })
my_df = my_df.fillna('')

pat1 = r'(\d %)(\d %)'
pat2 = r'([ -](?:[\d] |K))([ -](?:[\d] |K))'

my_df[['col0a', 'col0b']] = my_df['col0'].str.extract(pat1)
my_df[['col1a', 'col1b']] = my_df['col1'].str.extract(pat1)
my_df[['col2a', 'col2b']] = my_df['col2'].str.extract(pat2)
my_df[['col3a', 'col3b']] = my_df['col3'].str.extract(pat2)
my_df[['col4a', 'col4b']] = my_df['col4'].str.extract(pat2)
my_df = my_df.fillna('')

Output:

>>> my_df
   col0 col1  col2  col3 col4 col0a col0b col1a col1b col2a col2b col3a col3b col4a col4b
0  1%2%        1-2               1%    2%                 1    -2                        
1  3%4%        2-4   2-4         3%    4%                 2    -4     2    -4

CodePudding user response：

Not perfect, but this 2-line solution is better than the current 8-line solution I have that uses an if: else: block.

my_df[['col4a', 'col4b']] = my_df['col4'].fillna(' 999999-999999').str.split('(?<=[\d]|K)(?=[ -])', expand=True)
my_df[['col4a', 'col4b']] = my_df[['col4a', 'col4b']].replace(' 999999', '').replace('-999999', '')