I have a dataframe that looks like -
block lat lon
0 0 112 50
1 0 112 50
2 0 112 50
3 1 105 20
4 1 105 20
5 2 130 30
and I want to first groupby block and then apply a function to the lat lon columns. eg
df['location_id'] = df.groupby('block').apply(lambda x: get_location_id(x['lat'], x['lon'])
For each lat/lon my function returns an ID. I want a new column that has that ID. When I try above it doesn't work, and the lambda function seems to dislike me using axis=1 or similar. ie -
block lat lon location_id
0 0 112 50 1
1 0 112 50 1
2 0 112 50 1
3 1 105 20 23
4 1 105 20 23
5 2 130 30 15
I'd like to avoid just applying the function to the ungrouped dataframe because my dataset is quite large and that will be slow.
edit: the function returns a string id. Takes in lat lon and returns a single string value.
CodePudding user response:
Depending on how "large" your "large" Dataset is, there might be different solutions... And I'm not 100% certain you can (or should) do what you want with groupby. I'd suggest (and I'm relatively sure that it will work even in distributed environments) to do the following:
- Create a new dataframe with non-duplicated
block,lat,lon
combinations. - Apply your function to that dataframe.
- INNER join that dataframe to your original dataframe on those three columns. (While this feels bad, inner joins are usually done via hash joins and are usually quite fast even on spark clusters.)
Edit: Pandas is quite fine with handling datasets that take gigabytes in RAM as long as you have the ram, so just applying the functio might be a lot more viable than you think.