I'm data wrangling at the moment and I've a csv file, from kaggle in Indian Food Prices which I've converted into a dataframe of which one of the columns 'Market' has some categorical location data. For an ML problem I need latitude and longitude data which I've managed to get via the open streetmap api - however as the original dataframe had 131000 rows I created a separate dataframe with only single instances of the location and this new dataframe has 165 rows.
I now need to merge the two dataframes, but figure out a loop that'll populate all the rows from the original dataframe with 131000 rows with the lat and long data from the smaller dataframe of 165 rows, but have the lat and long match the location in the 'Market' column.
Any advice would be gratefully received,
Here's an attempt at trying to achieve the above
def using_where(ndf):
ndf['Lat-Long'] = np.where(df['Market']='Delhi', '28.6517178, 77.2219388'
Here's the head of my large data frame 'ndf'
<bound method NDFrame.head of Unnamed: 0 Date Market Category
\
0 1 1994-01-15 Delhi cereals and tubers
1 2 1994-01-15 Delhi cereals and tubers
2 3 1994-01-15 Delhi miscellaneous food
3 4 1994-01-15 Delhi oil and fats
4 5 1994-01-15 Ahmedabad cereals and tubers
... ... ... ... ...
139529 139530 2021-09-15 Kharagpur pulses and nuts
139530 139531 2021-09-15 Kharagpur pulses and nuts
139531 139532 2021-09-15 Kharagpur pulses and nuts
139532 139533 2021-09-15 Kharagpur vegetables and fruits
139533 139534 2021-09-15 Kharagpur vegetables and fruits
Commodity Unit PriceFlag PriceType Currency Price USD_Price
0 Rice KG actual Retail INR 8.0 0.2545
1 Wheat KG actual Retail INR 5.0 0.1590
2 Sugar KG actual Retail INR 13.5 0.4294
3 Oil (mustard) KG actual Retail INR 31.0 0.9860
4 Rice KG actual Retail INR 6.8 0.2163
... ... ... ... ... ... ... ...
139529 Lentils (masur) KG actual Retail INR 110.0 1.4972
139530 Lentils (moong) KG actual Retail INR 120.0 1.6333
139531 Lentils (urad) KG actual Retail INR 115.0 1.5653
139532 Onions KG actual Retail INR 30.0 0.4083
139533 Tomatoes KG actual Retail INR 40.0 0.5444
Here's the head of my small dataframe 'df'
<bound method NDFrame.head of Market geocoded
0 Delhi (28.6517178, 77.2219388)
4 Ahmedabad (23.0216238, 72.5797068)
8 Shimla (31.1041526, 77.1709729)
11 Bengaluru (12.9767936, 77.590082)
14 Bhopal (23.2584857, 77.401989)
... ... ...
136823 Dantewada (18.8640648, 81.38339468738648)
136970 Selamba -1
137053 Bodeli (22.2748105, 73.7166363)
137326 Dhanbad (23.7952809, 86.4309638)
137389 Jamshedpur (22.8015194, 86.2029579)
[165 rows x 2 columns]>
CodePudding user response:
I think you can just use merge()
, unless I'm missing something:
ndf = pd.merge(ndf, df, how='inner', on='Market')
Here's a full code example with test case:
import pandas as pd
ndf = pd.DataFrame({'Date':['1994-01-15']*5 ['2021-09-15']*5, 'Market':'Delhi,Delhi,Delhi,Delhi,Ahmedabad,Kharagpur,Kharagpur,Kharagpur,Kharagpur,Kharagpur'.split(','),
'Category':'cereals and tubers,cereals and tubers,miscellaneous food,oil and fats,cereals and tubers,pulses and nuts,pulses and nuts,pulses and nuts,vegetables and fruits,vegetables and fruits'.split(',')})
df = pd.DataFrame({'Market':'Delhi,Ahmedabad,Shimla,Bengaluru,Bhopal,Kharagpur'.split(','),
'geocoded':[(28.6517178, 77.2219388),(23.0216238, 72.5797068),(31.1041526, 77.1709729),(12.9767936, 77.590082),(23.2584857, 77.401989),(22.22, 73.73)]})
ndf = pd.merge(ndf, df, how='inner', on='Market')
print(ndf)
Output:
Date Market Category geocoded
0 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
1 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
2 1994-01-15 Delhi miscellaneous food (28.6517178, 77.2219388)
3 1994-01-15 Delhi oil and fats (28.6517178, 77.2219388)
4 1994-01-15 Ahmedabad cereals and tubers (23.0216238, 72.5797068)
5 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
6 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
7 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
8 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
9 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)