Home > Software engineering >  Extract matrix from a dataframe by value from columns
Extract matrix from a dataframe by value from columns

Time:02-23

I am trying something that could be a little hard to understand but i will try to be very specific.

I have a dataframe of python like this

Locality Count Lat. Long.
Krasnodar Russia 44 39
Tirana Albania 41.33 19.83
Areni Armenia 39.73 45.2
Kars Armenia 40.604517 43.100758
Brunn Wolfholz Austria 48.120396 16.291722
Kleinhadersdorf Flur Marchleiten Austria 48.663197 16.589687
Jalilabad district Azerbaijan 39.3607139 48.4613556
Zeyem Chaj Azerbaijan 40.9418889 45.8327778
Jalilabad district Azerbaijan 39.5186111 48.65

And a dataframe cities.txt with a the name of some countries:

Albania 
Armenia
Austria
Azerbaijan

And so on. The nex what I am doing is convert this Lat. and Long. values as radians and then with the values from the list do something like:

with open('cities.txt') as file:
  lines=file.readlines()
  x=np.where(df['Count'].eq(lines),pd.DataFrame(
  dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373,
    columns=df.Locality.unique(), index=df.Locality.unique()))

Where pd.DataFrame(dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373, columns=df.Locality.unique(), index=df.Locality.unique()) is converting radians in Lat. Long. into distances in km and create a dataframe as a matrix for each line (country).

In the end i will have a lot of matrix 2d (in theory) grouped by countries and i want to apply this:

>>>Russia.min()
0
>>>Russia.max()
5

to get the .min() and .max() value in each matrix and save this results in cities.txt as

Country Max.Dist. Min. Dist.
Albania  5    1
Armenia  10   9
Austria  5    3
Azerbaijan 0  0

Unfortunately, 1) I'm stock in the first part where I have an warning ValueError: Lengths must be equal, 2) can be possible have this matrix grouped by country and 3) save my .min() and .max() values?

CodePudding user response:

I am not sure what you exactly want as minimum. In this solution, the minimum is 0 if there is only 1 city, but otherwise the shortest distance between 2 cities within the country. Also, the filename cities.txt seems just a filter. I didn't do this but seems straightforward.

import numpy as np
import pandas as pd

Here just some sample data;

cities = pd.read_json("https://raw.githubusercontent.com/lutangar/cities.json/master/cities.json")
cities = cities.sample(10000)

Create and apply a custom aggregate for groupby()

from sklearn.metrics import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

country_groups = cities.groupby("country")

def city_distances(group):
    geo = group[['lat','lng']]

    EARTH_RADIUS = 6371

    haversine_distances = dist.pairwise(np.radians(geo) )
    haversine_distances *= EARTH_RADIUS
    
    distances = {}
    distances['max'] = np.max(haversine_distances)
    
    distances['min'] = 0
    if len(haversine_distances[ np.nonzero(haversine_distances)] ) > 0 :
        distances['min'] = np.min( haversine_distances[ np.nonzero(haversine_distances)] )
        
    return pd.Series(distances)

country_groups.apply(city_distances)

In my case this prints something like

                 max         min
country                         
AE        323.288482  323.288482
AF       1130.966661   15.435642
AI         12.056890   12.056890
AL        272.300688    3.437074
AM        268.051071    1.328605
...              ...         ...
YE        662.412344   19.103222
YT          3.723376    3.723376
ZA       1466.334609   24.319334
ZM       1227.429001  218.566369
ZW        503.562608   26.316902

[194 rows x 2 columns]
  • Related