Home > Back-end >  How to edit a string type column of a python dataframe according to rule?
How to edit a string type column of a python dataframe according to rule?

Time:03-18

For a university work, we have some geographic coordinates of world locations in a pandas dataframe:

      df = pd.DataFrame({'NAME': ['Paris', 'New York', 'Rio', 'Airport GRU', 'ORLY'],
               'GEO': ['POINT (2.31647 48.85)',
                       'POINT (-73.993457389558 40.731499671618)',
                       'POINT (-43.2 -22.9)',
                       'POINT (-46.47313507388693 -23.429382262746415)',
                       'GEOMETRYCOLLECTION EMPTY']})

      print(df)

        NAME                      GEO
       Paris                   POINT (2.31647 48.85)
       New York                POINT (-73.993457389558 40.731499671618)
        Rio                    POINT (-43.2 -22.9)
      Airport GRU              POINT (-46.47313507388693 -23.429382262746415)
        ORLY                   GEOMETRYCOLLECTION EMPTY

I would like to edit the 'GEO' column. Initially, I would like to disregard the word 'POINT', then I would like to put it in (Latitude, Longitude) format, as the column is in POINT (Longitude Latitude) format (and it's also being a comma).

To solve this, I created two separate columns to store the LAT and LONG (this part is working):

       df2 = df.join(df['GEO'].str.extract(r'(?P<LONG>-?\d \.\d ) (?P<LAT>-?\d \.\d )').astype(float))

      print(df2)

         NAME            GEO                                          LONG      LAT
        Paris       POINT (2.31647 48.85)                           2.316470   48.85000
        New York    POINT (-73.993457389558 40.731499671618)       -73.993457  40.731500
        Rio         POINT (-43.2 -22.9)                            -43.200000  -22.900000
        Airport GRU POINT (-46.47313507388693 -23.429382262746415) -46.473135  -23.429382
        ORLY        GEOMETRYCOLLECTION EMPTY                          NaN         NaN

However, when I try to create a new column to receive the format: (LAT, LONG). The code doesn't work:

      df2['Result'] = "("   df2['LAT']   ","   df2['LONG']   ")"

The error appears: "UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U1'), dtype('float64')) -> None"

I would like the output to be:

     NAME            GEO
   Paris            (48.85, 2.31647)
   New York         (40.731499671618, -73.993457389558)
   Rio              (-22.9, -43.2)
   Airport GRU      (-23.429382262746415, -46.47313507388693)
   ORLY             GEOMETRYCOLLECTION EMPTY]})

CodePudding user response:

You can use str.replace:

df['GEO'] = df['GEO'].str.replace('POINT\s \((.*)\s (.*)\)', r'(\2, \1)', regex=True)
print(df)

# Output
          NAME                                        GEO
0        Paris                           (48.85, 2.31647)
1     New York        (40.731499671618, -73.993457389558)
2          Rio                             (-22.9, -43.2)
3  Airport GRU  (-23.429382262746415, -46.47313507388693)
4         ORLY                   GEOMETRYCOLLECTION EMPTY

CodePudding user response:

We can try using str.extract here:

df[["LONG", "LAT"]] = df["GEO"].str.extract(r'POINT \((-?\d (?:\.\d )?) (-?\d (?:\.\d )?)\)')

For entries which don't match the POINT format, the above would assign a missing value. You may replace this with whatever placeholder you want.

  • Related