Create a new column in pd dataframe using a formula-CodePudding

let me rephrase my question:

I have the following dataset:

data = {
    'globalId': {0: 4388064, 1: 4388200, 2: 4399344, 3: 4400638, 4: 4401765, 5: 4401831},
    'publicatieDatum': {0: '2018-07-31', 1: '2018-09-24', 2: '2018-08-02', 3: '2018-08-04', 4: '2018-08-05', 5: '2018-08-06'}, 
    'postcode': {0: '1774PG', 1: '7481LK', 2: '1068MS', 3: '5628EN', 4: '7731TV', 5: '5971CR'},
    'koopPrijs': {0: 139000.0, 1: 209000.0, 2: 267500.0, 3: 349000.0, 4: 495000.0, 5: 162500.0}
}

df = pd.DataFrame(data)
print(df)

This is the code for the municipality name of the zipcode/postal code of the 1st row:

nomi.query_postal_code(["1774"])

Now, I want to add a column called 'Gemeente'.

This can be retreived using the following formule:

>>> nomi.query_postal_code(["postcode"])

The postcode above should indicate the 4 numbers of the postcode within the postcode column.

I have 2 questions:

How can i add a code that calculates the gemeente for all rows in the above dataframe, based on the 'postcode', as specified above.
How can this code be written that it only selects the first 4 digits in the postcode column.

Apologies and thanks!

CodePudding user response：

nomi['is_1774']=[v == '1774' for v in nomi['postal_code']]

CodePudding user response：

I'm assuming that you are working with the pgeocode library?

You could try:

import pandas as pd
import pgeocode

nomi = pgeocode.Nominatim('nl')

data = {
    'globalId': {0: 4388064, 1: 4388200, 2: 4399344, 3: 4400638, 4: 4401765, 5: 4401831},
    'publicatieDatum': {0: '2018-07-31', 1: '2018-09-24', 2: '2018-08-02', 3: '2018-08-04', 4: '2018-08-05', 5: '2018-08-06'}, 
    'postcode': {0: '1774PG', 1: '7481LK', 2: '1068MS', 3: '5628EN', 4: '7731TV', 5: '5971CR'},
    'koopPrijs': {0: 139000.0, 1: 209000.0, 2: 267500.0, 3: 349000.0, 4: 495000.0, 5: 162500.0}
}
df = pd.DataFrame(data)

df["postal_code"] = df.postcode.str[:4]
df = df.merge(
    nomi.query_postal_code(set(df.postal_code))[["postal_code", "state_name"]],
    on="postal_code", how="left"
)

Result:

   globalId publicatieDatum postcode  koopPrijs postal_code     state_name
0   4388064      2018-07-31   1774PG   139000.0        1774  Noord-Holland
1   4388200      2018-09-24   7481LK   209000.0        7481     Overijssel
2   4399344      2018-08-02   1068MS   267500.0        1068  Noord-Holland
3   4400638      2018-08-04   5628EN   349000.0        5628  Noord-Brabant
4   4401765      2018-08-05   7731TV   495000.0        7731     Overijssel
5   4401831      2018-08-06   5971CR   162500.0        5971        Limburg

Seems to be the better way. What field do you need?

Or do:

import pandas as pd
import pgeocode

nomi = pgeocode.Nominatim('nl')

data = ... as above ...
df = pd.DataFrame(data)

df["postal_code"] = df.postcode.str[:4]
df = df.merge(
    nomi.query_postal_code(set(df.postal_code)), on="postal_code", how="left"
)

Result:

   globalId publicatieDatum postcode  ...  latitude longitude accuracy
0   4388064      2018-07-31   1774PG  ...   52.8472    4.9702      6.0
1   4388200      2018-09-24   7481LK  ...   52.1536    6.7528      6.0
2   4399344      2018-08-02   1068MS  ...   52.3592    4.8052      6.0
3   4400638      2018-08-04   5628EN  ...   51.4801    5.4724      6.0
4   4401765      2018-08-05   7731TV  ...   52.5240    6.4208      6.0
5   4401831      2018-08-06   5971CR  ...   51.4198    6.1408      6.0

[6 rows x 16 columns]

Then just drop the columns you don't need.