I am trying to access a column, filter its numbers and then split in 3 columns. But i have been only getting errors. I am trying this:
dsc = df["Descricao"].str.findall("\d ")
dsc
The Output:
0 []
1 [475, 2000, 3]
2 [65, 2000, 2]
3 [51, 2000, 3]
4 [320, 2000, 3]
...
2344 NaN
2345 [480, 2000, 1]
2346 [32, 2000, 6]
2347 [250, 2000, 1]
2348 NaN
Name: Descricao, Length: 2349, dtype: object
Then, I am trying to split and everytime i get this kind of error:
df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
df.head(5)
The Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15388/2481153233.py in <module>
----> 1 df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
2 df.head(5)
~\anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3598 self._setitem_frame(key, value)
3599 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3600 self._setitem_array(key, value)
3601 elif isinstance(value, DataFrame):
3602 self._set_item_frame_value(key, value)
~\anaconda3\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value)
3637 else:
3638 if isinstance(value, DataFrame):
-> 3639 check_key_length(self.columns, key, value)
3640 for k1, k2 in zip(key, value.columns):
3641 self[k1] = value[k2]
~\anaconda3\lib\site-packages\pandas\core\indexers.py in check_key_length(columns, key, value)
426 if columns.is_unique:
427 if len(value.columns) != len(key):
--> 428 raise ValueError("Columns must be same length as key")
429 else:
430 # Missing keys in columns are represented as -1
ValueError: Columns must be same length as key
I think there is something to do with str.findall generating a list of lists. Does anybody know how can I solve this? For information, all my columns are objects.
CodePudding user response:
You could try this:
dsc = pd.DataFrame(df["Descricao"].str.findall("\d ").tolist(), columns=['Larg','comp', 'qtd'])
df = pd.concat([df, dsc], axis=1)
Note that this may not work if there are more than three columns at any point (I assume this will not be the case, given your attempt).
This method came from here.
CodePudding user response:
In the general case, some of the inputs may not have strings that parse to 3 numerical values.
Here is a way to do what the question asks while filling the new columns for any unusual rows with NaNs. If the desired behavior for non-standard rows is different, the logic can be adjusted as needed.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Descricao' : ['', '475,2000,3', '65,2000,2', np.nan, 'abc,def,ghi', '1,2', '1']
})
print(f"\nInput dataframe:\n{df}")
df2 = df["Descricao"].str.findall("\d ").to_frame()
print(f"\nDataframe with lists of 3 where possible:\n{df2}")
df2["Descricao"] = df2.apply(lambda x:
x["Descricao"]
if (len(x["Descricao"]) if isinstance(x["Descricao"], list) else 0) == 3 else
[np.NaN]*3,
axis=1)
print(f"\nDataframe with lists include NaNs for incomplete data:\n{df2}")
df2[['Larg','comp', 'qtd']] = pd.DataFrame(df2["Descricao"].tolist(), columns=['Larg','comp', 'qtd'])
df2 = df2.drop(['Descricao'], axis=1)
print(f"\nResult dataframe with NaNs for incomplete inputs:\n{df2}")
Sample Output:
Input dataframe:
Descricao
0
1 475,2000,3
2 65,2000,2
3 NaN
4 abc,def,ghi
5 1,2
6 1
Dataframe with lists of 3 where possible:
Descricao
0 []
1 [475, 2000, 3]
2 [65, 2000, 2]
3 NaN
4 []
5 [1, 2]
6 [1]
Dataframe with lists include NaNs for incomplete data:
Descricao
0 [nan, nan, nan]
1 [475, 2000, 3]
2 [65, 2000, 2]
3 [nan, nan, nan]
4 [nan, nan, nan]
5 [nan, nan, nan]
6 [nan, nan, nan]
Result dataframe with NaNs for incomplete inputs:
Larg comp qtd
0 NaN NaN NaN
1 475 2000 3
2 65 2000 2
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN