Home > Software engineering >  Fill pandas column based on range and category of another
Fill pandas column based on range and category of another

Time:08-12

I have two dataframes.

DF1

DF1

DF2

DF2

I want to add a column to DF1, 'Speed', that references the track category, and the LocationFrom and LocationTo range, to result in the below.

enter image description here

I have looked at merge_asof, and IntervalIndex, but unable to figure out how to reference the category before the range.

Thanks.

CodePudding user response:

Check Below code: SQLITE

import pandas as pd
import sqlite3

conn = sqlite3.connect(':memory:')

DF1.to_sql('DF1', con = conn, index = False)
DF2.to_sql('DF2', con = conn, index = False)

pd.read_sql("""Select DF1.*, DF2.Speed 
              From DF1 
              join DF2 on DF1.Track = Df2.Track 
              AND DF1.Location BETWEEN DF2.LocationFrom and DF2.LocationTo""", con=conn)

Output:

enter image description here

CodePudding user response:

It works, but I would like to see someone do this without a for loop and without creating mini dataframes.

import pandas as pd

data1 = {'Track': list('AAAB'), 'Location': [1, 2, 6, 24]}
df1 = pd.DataFrame(data1)

data2 = {'Track': list('AABB'), 'LocationFrom': [0, 5, 0, 20], 'LocationTo': [5, 10, 20, 50], 'Speed': [45, 50, 80, 100]}
df2 = pd.DataFrame(data2)

speeds = []
for k in range(len(df1)):
    track = df1['Track'].iloc[k]
    location = df1['Location'].iloc[k]
    df1_track = df1.loc[df1['Track'] == track]
    df2_track = df2.loc[df2['Track'] == track]
    speeds.append(df2_track['Speed'].loc[(df2_track['LocationFrom'] <= location) & (location < df2_track['LocationTo'])].iloc[0])
df1['Speed'] = speeds
print(df1)

Output:

  Track  Location  Speed
0     A         1     45
1     A         2     45
2     A         6     50
3     B        24    100

CodePudding user response:

This approach is probably not viable if your tables are large. It creates an intermediate table which has a merge of all pairs of matching Tracks between df1 and df2. Then it removes rows where the location is not between the boundaries. Thanks @Aeronatix for the dfs.

The all_merge intermediate table gets really big really fast. If a1 rows of df1 are Track A, a2 in df2 etc.. then the total rows in all_merge will be a1*a2 b1*b2 c1*c2... z1*z2 which might or might not be gigantic depending on your dataset

all_merge = df1.merge(df2)
results = all_merge[all_merge.Location.between(all_merge.LocationFrom,all_merge.LocationTo)]
print(results)
  • Related