Home > database >  replacing strings in pandas dataframes with lists of strings
replacing strings in pandas dataframes with lists of strings

Time:10-13

I have a pandas dataframe read from file, some of whose columns contain strings, some of which in turn contain substrings separated by semicolons. My goal is to turn the semicolon-separated substrings into lists of strings and put those back in the dataframe.

When I use df.iloc[-1][-1] = df.iloc[-1][-1].split(';'); on a cell that contains a string with semicolons, there's no error but the value df.iloc[-1][-1] is not changed.

When I use

newval          = df.iloc[-1,-1];              newval
newval          = df.iloc[-1,-1].split( ';' ); newval
df.iloc[-1][-1] = newval;                      df.iloc[-1][-1]

It shows the original string for the first line and the list of substrings for the second, but then the original string again for the third. It looks as if nothing has been assigned -- but there was no error message either.

My first guess was that it was not allowed to put a list of strings in a cell that contains strings but a quick test showed me that that is OK:

>>> df = pd.DataFrame([["a", "a;b"], ["a;A", "a;b;A;B"]], index=[1, 2], columns=['A', 'B']);
>>> df
     A        B
1    a      a;b
2  a;A  a;b;A;B
>>> for row in range ( df.shape [ 0 ] ):
...     for col in range ( df.shape [ 1 ] ):
...         value = df.iloc[row][col];        
...         if ( type ( value ) == str ):
...             value = value.split( ';' );
...             df.iloc[row][col] = value;      
>>> df
        A             B
1     [a]        [a, b]
2  [a, A]  [a, b, A, B]

So I'm puzzled why (i) the assignment works in the example but not for my CSV-imported dataframe, and (ii) why python does not give an error message?

CodePudding user response:

Honestly, you can simplify your code avoiding the loops with a simple applymap. Loops should be avoided with pandas. Here applymap won't necessarily be faster, but it's definitely much easier to use and understand.

out = df.applymap(lambda x: x.split(';'))

output:

        A             B
1     [a]        [a, b]
2  [a, A]  [a, b, A, B]
why your approach failed

You're using df.iloc[row][col] = value which can cause setting value on a copy, you should use df.iloc[row, col] = value instead. Did you get a SettingWithCopyWarning?

not all values are strings:
df.applymap(lambda x: x.split(';') if isinstance(x, str) else x)

Example:

df = pd.DataFrame([["a", 2], ["a;A", "a;b;A;B"]], index=[1, 2], columns=['A', 'B'])
df.applymap(lambda x: x.split(';') if isinstance(x, str) else x)

        A             B
1     [a]             2
2  [a, A]  [a, b, A, B]
  • Related