let me rephrase my question:
I have the following dataset:
data = {
'globalId': {0: 4388064, 1: 4388200, 2: 4399344, 3: 4400638, 4: 4401765, 5: 4401831},
'publicatieDatum': {0: '2018-07-31', 1: '2018-09-24', 2: '2018-08-02', 3: '2018-08-04', 4: '2018-08-05', 5: '2018-08-06'},
'postcode': {0: '1774PG', 1: '7481LK', 2: '1068MS', 3: '5628EN', 4: '7731TV', 5: '5971CR'},
'koopPrijs': {0: 139000.0, 1: 209000.0, 2: 267500.0, 3: 349000.0, 4: 495000.0, 5: 162500.0}
}
df = pd.DataFrame(data)
print(df)
This is the code for the municipality name of the zipcode/postal code of the 1st row:
nomi.query_postal_code(["1774"])
Now, I want to add a column called 'Gemeente'.
This can be retreived using the following formule:
>>> nomi.query_postal_code(["postcode"])
The postcode above should indicate the 4 numbers of the postcode within the postcode column.
I have 2 questions:
How can i add a code that calculates the gemeente for all rows in the above dataframe, based on the 'postcode', as specified above.
How can this code be written that it only selects the first 4 digits in the postcode column.
Apologies and thanks!
CodePudding user response:
nomi['is_1774']=[v == '1774' for v in nomi['postal_code']]
CodePudding user response:
I'm assuming that you are working with the pgeocode library?
You could try:
import pandas as pd
import pgeocode
nomi = pgeocode.Nominatim('nl')
data = {
'globalId': {0: 4388064, 1: 4388200, 2: 4399344, 3: 4400638, 4: 4401765, 5: 4401831},
'publicatieDatum': {0: '2018-07-31', 1: '2018-09-24', 2: '2018-08-02', 3: '2018-08-04', 4: '2018-08-05', 5: '2018-08-06'},
'postcode': {0: '1774PG', 1: '7481LK', 2: '1068MS', 3: '5628EN', 4: '7731TV', 5: '5971CR'},
'koopPrijs': {0: 139000.0, 1: 209000.0, 2: 267500.0, 3: 349000.0, 4: 495000.0, 5: 162500.0}
}
df = pd.DataFrame(data)
df["postal_code"] = df.postcode.str[:4]
df = df.merge(
nomi.query_postal_code(set(df.postal_code))[["postal_code", "state_name"]],
on="postal_code", how="left"
)
Result:
globalId publicatieDatum postcode koopPrijs postal_code state_name
0 4388064 2018-07-31 1774PG 139000.0 1774 Noord-Holland
1 4388200 2018-09-24 7481LK 209000.0 7481 Overijssel
2 4399344 2018-08-02 1068MS 267500.0 1068 Noord-Holland
3 4400638 2018-08-04 5628EN 349000.0 5628 Noord-Brabant
4 4401765 2018-08-05 7731TV 495000.0 7731 Overijssel
5 4401831 2018-08-06 5971CR 162500.0 5971 Limburg
Seems to be the better way. What field do you need?
Or do:
import pandas as pd
import pgeocode
nomi = pgeocode.Nominatim('nl')
data = ... as above ...
df = pd.DataFrame(data)
df["postal_code"] = df.postcode.str[:4]
df = df.merge(
nomi.query_postal_code(set(df.postal_code)), on="postal_code", how="left"
)
Result:
globalId publicatieDatum postcode ... latitude longitude accuracy
0 4388064 2018-07-31 1774PG ... 52.8472 4.9702 6.0
1 4388200 2018-09-24 7481LK ... 52.1536 6.7528 6.0
2 4399344 2018-08-02 1068MS ... 52.3592 4.8052 6.0
3 4400638 2018-08-04 5628EN ... 51.4801 5.4724 6.0
4 4401765 2018-08-05 7731TV ... 52.5240 6.4208 6.0
5 4401831 2018-08-06 5971CR ... 51.4198 6.1408 6.0
[6 rows x 16 columns]
Then just drop the columns you don't need.