Pandas groupby unable to concat strings due to TypeError-CodePudding

I've hit a dead end trying to group records in my df and stacking values from one of the columns. I have a df of ~390k record of such shape:

df = pd.DataFrame({
    'Województwo': {14: 'ŁÓDZKIE', 15: 'ŁÓDZKIE'},
    'Powiat': {14: 'bełchatowski', 15: 'bełchatowski'},
    'Gmina': {14: 'Bełchatów', 15: 'Bełchatów'},
    'Miejscowość (GUS)': {14: 'Bełchatów', 15: 'Bełchatów'},
    'Ulica (cecha)': {14: 'al.', 15: 'al.'},
    'Ulica (nazwa)': {14: 'Aleja ks. Kardynała Stefana Wyszyńskiego', 15: 'Aleja ks. Kardynała Stefana Wyszyńskiego'},
    'Kod pocztowy (PNA)': {14: '97-400', 15: '97-402'},
    'Kod województwa': {14: 'vosti_province_lodzkie',15: 'vosti_province_lodzkie'},
    'Kod powiatu': {14: 'district_lodzkie_belchatowski',15: 'district_lodzkie_belchatowski'},
    'Kod gminy': {14: 'commune_belchatowski_belchatow',15: 'commune_belchatowski_belchatow'},
    'Kod miejscowości': {14: 'town_belchatow_belchatow',15: 'town_belchatow_belchatow'},
    'Kod cechy adresu': {14: 'address_prefix_al', 15: 'address_prefix_al'},
    'Kod adresu': {14: 'address_belchatow_aleja_ks_kardynala_stefana_wyszynskiego',15:'address_belchatow_aleja_ks_kardynala_stefana_wyszynskiego'}})

I want to get rid of duplicates while stacking up values in the column "Kod pocztowy (PNA)". To do so, I figured out such line:

db_miasto = pd.DataFrame(db.groupby(['Województwo', 'Powiat', 'Gmina', 'Miejscowość (GUS)', 'Kod gminy', 'Kod miejscowości'], as_index=False)['Kod pocztowy (PNA)'].apply(lambda x: ",".join(x)))

It does work in case of the example records giving me back such df:

final_df = pd.DataFrame(
    {'Województwo': {0: 'ŁÓDZKIE'},
    'Powiat': {0: 'bełchatowski'},
    'Gmina': {0: 'Bełchatów'},
    'Miejscowość (GUS)': {0: 'Bełchatów'},
    'Kod gminy': {0: 'commune_belchatowski_belchatow'},
    'Kod miejscowości': {0: 'town_belchatow_belchatow'},
    'Kod pocztowy (PNA)': {0: '97-400,97-402'}})

and that is exactly the result I'm expecting. However...

If I try to run that same formula on the entire df of 394k records I hit an error:

TypeError: sequence item 8: expected str instance, float found
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/opt/anaconda3/envs/Productive24_SPK/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
   1252             try:
-> 1253                 result = self._python_apply_general(f, self._selected_obj)
   1254             except TypeError:

/opt/anaconda3/envs/Productive24_SPK/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data)
   1286         """
-> 1287         keys, values, mutated = self.grouper.apply(f, data, self.axis)
   1288 

/opt/anaconda3/envs/Productive24_SPK/lib/python3.8/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    819             group_axes = group.axes
--> 820             res = f(group)
    821             if not _is_indexed_like(res, group_axes, axis):

/var/folders/w1/ghc7r0mx6mj933lyktx82w480000gn/T/ipykernel_79776/2296150702.py in <lambda>(x)
----> 1 db_miasto = pd.DataFrame(db.groupby(['Województwo', 'Powiat', 'Gmina', 'Miejscowość (GUS)', 'Kod gminy', 'Kod miejscowości'], as_index=False)['Kod pocztowy (PNA)'].apply(lambda x: ",".join(x)))

TypeError: sequence item 8: expected str instance, float found

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/var/folders/w1/ghc7r0mx6mj933lyktx82w480000gn/T/ipykernel_79776/2296150702.py in <module>
----> 1 db_miasto = pd.DataFrame(db.groupby(['Województwo', 'Powiat', 'Gmina', 'Miejscowość (GUS)', 'Kod gminy', 'Kod miejscowości'], as_index=False)['Kod pocztowy (PNA)'].apply(lambda x: ",".join(x)))

/opt/anaconda3/envs/Productive24_SPK/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
   1262 
   1263                 with group_selection_context(self):
-> 1264                     return self._python_apply_general(f, self._selected_obj)
   1265 
   1266         return result

/opt/anaconda3/envs/Productive24_SPK/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data)
   1285             data after applying f
   1286         """
-> 1287         keys, values, mutated = self.grouper.apply(f, data, self.axis)
   1288 
   1289         return self._wrap_applied_output(

/opt/anaconda3/envs/Productive24_SPK/lib/python3.8/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    818             # group might be modified
    819             group_axes = group.axes
--> 820             res = f(group)
    821             if not _is_indexed_like(res, group_axes, axis):
    822                 mutated = True

/var/folders/w1/ghc7r0mx6mj933lyktx82w480000gn/T/ipykernel_79776/2296150702.py in <lambda>(x)
----> 1 db_miasto = pd.DataFrame(db.groupby(['Województwo', 'Powiat', 'Gmina', 'Miejscowość (GUS)', 'Kod gminy', 'Kod miejscowości'], as_index=False)['Kod pocztowy (PNA)'].apply(lambda x: ",".join(x)))

TypeError: sequence item 8: expected str instance, float found

I have tried modifying my formula by going with .apply(lambda x: ",".join(str(x))) in that lambda function and it does go through the whole file without error but returns it like that:

wrong_df = pd.DataFrame({
    'Województwo': {0: 'ŁÓDZKIE'},
    'Powiat': {0: 'bełchatowski'},
    'Gmina': {0: 'Bełchatów'},
    'Miejscowość (GUS)': {0: 'Bełchatów'},
    'Kod gminy': {0: 'commune_belchatowski_belchatow'},
    'Kod miejscowości': {0: 'town_belchatow_belchatow'},
    'Kod pocztowy (PNA)': {0: '1,4, , , , ,9,7,-,4,0,0,\n,1,5, , , , ,9,7,-,4,0,2,\n,N,a,m,e,:, ,(,Ł,Ó,D,Z,K,I,E,,, ,b,e,ł,c,h,a,t,o,w,s,k,i,,, ,B,e,ł,c,h,a,t,ó,w,,, ,B,e,ł,c,h,a,t,ó,w,,, ,c,o,m,m,u,n,e,_,b,e,l,c,h,a,t,o,w,s,k,i,_,b,e,l,c,h,a,t,o,w,,, ,t,o,w,n,_,b,e,l,c,h,a,t,o,w,_,b,e,l,c,h,a,t,o,w,),,, ,d,t,y,p,e,:, ,o,b,j,e,c,t'}})

and that's some utter BS...

I have no idea how to interpret "sequence item 8" and how to check, which part of df that exactly is... I've checked the type of the columns and it's supposedly strings not floats:

Województwo           object
Powiat                object
Gmina                 object
Miejscowość (GUS)     object
Ulica (cecha)         object
Ulica (nazwa)         object
Kod pocztowy (PNA)    object
Kod województwa       object
Kod powiatu           object
Kod gminy             object
Kod miejscowości      object
Kod cechy adresu      object
Kod adresu            object
dtype: object

I'm really running short on time with this projects thus comming here for help. Any idea how to fix the issue or is there other way to stack up duplicate rows in the way described?

Thank in advance!

CodePudding user response：

You can first transform the column to string:

df["Kod pocztowy (PNA)"] = df["Kod pocztowy (PNA)"].astype(str)
db_miasto = pd.DataFrame(
    df.groupby(
        [
            "Województwo",
            "Powiat",
            "Gmina",
            "Miejscowość (GUS)",
            "Kod gminy",
            "Kod miejscowości",
        ],
        as_index=False,
    )["Kod pocztowy (PNA)"].apply(",".join)
)

Prints:

  Województwo        Powiat      Gmina Miejscowość (GUS)                       Kod gminy          Kod miejscowości Kod pocztowy (PNA)
0     ŁÓDZKIE  bełchatowski  Bełchatów         Bełchatów  commune_belchatowski_belchatow  town_belchatow_belchatow      97-400,97-402

CodePudding user response：

It's type indifference. The main issue - that the value which U want to groupby

As U can see in your example the PNA is string.

{14: '97-400', 15: '97-402'},

So cast overall dataframe into string:

df["Kod pocztowy (PNA)"] = df["Kod pocztowy (PNA)"].astype(str)

Later the operation of groupBy shouldn't throw error.