Home > Software design >  New column in dataframe using multiple rows of existing columns
New column in dataframe using multiple rows of existing columns

Time:09-27

My goal is to output an organized file from a tuple of geographic data, including new columns with transformations of the data. The input tuple includes the following arrays: lats (latitudes), lons (longitudes), els (elevations), and ts (times in seconds). The rows of the tuples correspond to individual samples of data. I converted this tuple into a dataframe. Now I am looking to add columns to the dataframe using functions I already have written.

The problem is that these functions' arguments use multiple rows of the dataframe in order to work. Example:

def stepsize(lat1, long1, lat2, long2):

I want to add a column of "stepsizes" to my dataframe but that involves referencing both the current row and the previous row of data. How should I do this?

I have tried using df.apply(), passing a lambda function. In order to make that work, I defined local variables to hold previous latitude, longitude, and time:

def newdist_row(row):
if row["time"] < 72889:
    rowprevlat = row['latitude']
    rowprevlong = row['longitude']
    rowprevtime = row['time']
    return np.nan
else:
    dist = stepsize_feet(row['latitude'], row['longitude'], rowprevlat, rowprevlong)
    rowprevlat = row['latitude']
    rowprevlong = row['longitude']
    rowprevtime = row['time']
    return dist

The apply method I wrote:

contents.apply(lambda row: newdist_row(row), axis=1)

This doesn't work due to reference before assignment; I tinkered with it and couldn't get it to work. I also tried loops, down to simply trying to add a new column with the correct "last-row" data, but got a "value is trying to be set on a copy of a slice from a DataFrame" warning.

CodePudding user response:

You can create prev_latitude and prev_longitude by doing:

Given df:

  latitude longitude
0   607385    435618
1   430603    435618
2   430603    435618
3   430603    435618
4   519445    435618

Doing:

df[['prev_latitude', 'prev_longitude']] = df[['latitude', 'longitude']].shift()

Output:

   latitude  longitude  prev_latitude  prev_longitude
0    607385     435618           <NA>            <NA>
1    430603     435618         607385          435618
2    430603     435618         430603          435618
3    430603     435618         430603          435618
4    519445     435618         430603          435618

Now, you should be able to use your existing function:

df.apply(lambda x: stepsize(x.latitude, 
                            x.longitude, 
                            x.prev_latitude, 
                            x.prev_longitude), axis=1)

(after you decided what to do with the first empty row...)


If you're not already using it... you should look into GeoPandas, you may be re-inventing the wheel. Idk what stepsize means for you, but if it's anything like distance between the two points...

import geopandas as gp

geometry = gp.GeoSeries.from_xy(df.longitude, df.latitude)
df = gp.GeoDataFrame(df, geometry=geometry)
df['distance'] = df.geometry.distance(df.geometry.shift())

Then this does that for you.

  • Related