How to Isolate numbers from a column and create 3 new columns?-CodePudding

I am trying to access a column, filter its numbers and then split in 3 columns. But i have been only getting errors. I am trying this:

dsc = df["Descricao"].str.findall("\d ")
dsc

The Output:
0                   []
1       [475, 2000, 3]
2        [65, 2000, 2]
3        [51, 2000, 3]
4       [320, 2000, 3]
             ...      
2344               NaN
2345    [480, 2000, 1]
2346     [32, 2000, 6]
2347    [250, 2000, 1]
2348               NaN
Name: Descricao, Length: 2349, dtype: object

Then, I am trying to split and everytime i get this kind of error:

df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
df.head(5)

The Error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15388/2481153233.py in <module>
----> 1 df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
      2 df.head(5)

~\anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3598             self._setitem_frame(key, value)
   3599         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3600             self._setitem_array(key, value)
   3601         elif isinstance(value, DataFrame):
   3602             self._set_item_frame_value(key, value)

~\anaconda3\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value)
   3637         else:
   3638             if isinstance(value, DataFrame):
-> 3639                 check_key_length(self.columns, key, value)
   3640                 for k1, k2 in zip(key, value.columns):
   3641                     self[k1] = value[k2]

~\anaconda3\lib\site-packages\pandas\core\indexers.py in check_key_length(columns, key, value)
    426     if columns.is_unique:
    427         if len(value.columns) != len(key):
--> 428             raise ValueError("Columns must be same length as key")
    429     else:
    430         # Missing keys in columns are represented as -1

ValueError: Columns must be same length as key

I think there is something to do with str.findall generating a list of lists. Does anybody know how can I solve this? For information, all my columns are objects.

CodePudding user response：

You could try this:

dsc = pd.DataFrame(df["Descricao"].str.findall("\d ").tolist(), columns=['Larg','comp', 'qtd'])

df = pd.concat([df, dsc], axis=1)

Note that this may not work if there are more than three columns at any point (I assume this will not be the case, given your attempt).

This method came from here.

CodePudding user response：

In the general case, some of the inputs may not have strings that parse to 3 numerical values.

Here is a way to do what the question asks while filling the new columns for any unusual rows with NaNs. If the desired behavior for non-standard rows is different, the logic can be adjusted as needed.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'Descricao' : ['', '475,2000,3', '65,2000,2', np.nan, 'abc,def,ghi', '1,2', '1']
})
print(f"\nInput dataframe:\n{df}")

df2 = df["Descricao"].str.findall("\d ").to_frame()
print(f"\nDataframe with lists of 3 where possible:\n{df2}")

df2["Descricao"] = df2.apply(lambda x: 
    x["Descricao"] 
        if (len(x["Descricao"]) if isinstance(x["Descricao"], list) else 0) == 3 else 
    [np.NaN]*3, 
    axis=1)
print(f"\nDataframe with lists include NaNs for incomplete data:\n{df2}")

df2[['Larg','comp', 'qtd']] = pd.DataFrame(df2["Descricao"].tolist(), columns=['Larg','comp', 'qtd'])
df2 = df2.drop(['Descricao'], axis=1)
print(f"\nResult dataframe with NaNs for incomplete inputs:\n{df2}")

Sample Output:


Input dataframe:
     Descricao
0
1   475,2000,3
2    65,2000,2
3          NaN
4  abc,def,ghi
5          1,2
6            1

Dataframe with lists of 3 where possible:
        Descricao
0              []
1  [475, 2000, 3]
2   [65, 2000, 2]
3             NaN
4              []
5          [1, 2]
6             [1]

Dataframe with lists include NaNs for incomplete data:
         Descricao
0  [nan, nan, nan]
1   [475, 2000, 3]
2    [65, 2000, 2]
3  [nan, nan, nan]
4  [nan, nan, nan]
5  [nan, nan, nan]
6  [nan, nan, nan]

Result dataframe with NaNs for incomplete inputs:
  Larg  comp  qtd
0  NaN   NaN  NaN
1  475  2000    3
2   65  2000    2
3  NaN   NaN  NaN
4  NaN   NaN  NaN
5  NaN   NaN  NaN
6  NaN   NaN  NaN