Home > Net >  better way to iterate through rows in a data frame and conditionally assign a group
better way to iterate through rows in a data frame and conditionally assign a group

Time:09-20

I have created a function that assigns which latitude & longitude category each row fall in to. However, the issue is... this is way too slow. Please help me to increase the performance. Here is my code.

def assign_segment(use_df: pd.DataFrame, 
                   lat_categories: pd.core.indexes.interval.IntervalIndex, 
                   lng_categories: pd.core.indexes.interval.IntervalIndex) -> pd.DataFrame:
    """
    Assign segments based on the latitude and longtitude column of "use_tb".

    Parameters
    ----------
    use_df : pd.DataFrame
        Use DataFrame.
    lat_categories : pd.core.indexes.interval.IntervalIndex
        Latitude interval categories.
        (ex.) IntervalIndex([(35.809, 35.816], (35.816, 35.824], 
                             (35.824, 35.832], (35.832, 35.84], (35.84, 35.848]])
    lng_categories : pd.core.indexes.interval.IntervalIndex
        Lontitude interval categories.
        (ex.) IntervalIndex([(128.668, 128.685], (128.685, 128.703], 
                             (128.703, 128.72], (128.72, 128.737]])

    Returns
    -------
    use_df : pd.DataFrame
        "use_df" with segments assigned.
    """
    segment = []

    # iterate each row and get the segment according to latitude and longitude
    for idx, row in use_df.iterrows():
        use_lat = row['use_lat']
        use_lng = row['use_lng']

        for lat_idx, lat_category in enumerate(lat_categories):
            if use_lat in lat_category:
                lat_segment = lat_idx   1
                break
        for lng_idx, lng_category in enumerate(lng_categories):
            if use_lng in lng_category:
                lng_segment = lng_idx   1
                break

        num_lng_grid = len(lat_categories)      # number of longtitude grid
        lng_num_digits = len(str(num_lng_grid)) # number of digits of lng_grid
        segment.append((lat_segment*10**lng_num_digits) lng_segment)
        
    # create the segment column with the segment list that we created in this function
    use_df['segment'] = segment

    return use_df

CodePudding user response:

iterrows() is very slow when iterating over rows. Some developers think that iterrows should never be used. We have to resort to some vectorization to speed up the code. you can use tqdm.

from tqdm import tqdm
def assign_segment(use_df: pd.DataFrame, 
                   lat_categories: pd.core.indexes.interval.IntervalIndex, 
                   lng_categories: pd.core.indexes.interval.IntervalIndex) -> pd.DataFrame:
    """
    Assign segments based on the latitude and longtitude column of "use_tb".

    Parameters
    ----------
    use_df : pd.DataFrame
        Use DataFrame.
    lat_categories : pd.core.indexes.interval.IntervalIndex
        Latitude interval categories.
        (ex.) IntervalIndex([(35.809, 35.816], (35.816, 35.824], 
                             (35.824, 35.832], (35.832, 35.84], (35.84, 35.848]])
    lng_categories : pd.core.indexes.interval.IntervalIndex
        Lontitude interval categories.
        (ex.) IntervalIndex([(128.668, 128.685], (128.685, 128.703], 
                             (128.703, 128.72], (128.72, 128.737]])

    Returns
    -------
    use_df : pd.DataFrame
        "use_df" with segments assigned.
    """
    segment = []

    # iterate each row and get the segment according to latitude and longitude
    for row in tqdm(use_df.to_dict('records')):
        use_lat = row['use_lat']
        use_lng = row['use_lng']

        for lat_idx, lat_category in enumerate(lat_categories):
            if use_lat in lat_category:
                lat_segment = lat_idx   1
                break
        for lng_idx, lng_category in enumerate(lng_categories):
            if use_lng in lng_category:
                lng_segment = lng_idx   1
                break

        num_lng_grid = len(lat_categories)      # number of longtitude grid
        lng_num_digits = len(str(num_lng_grid)) # number of digits of lng_grid
        segment.append((lat_segment*10**lng_num_digits) lng_segment)
        
    # create the segment column with the segment list that we created in this function
    use_df['segment'] = segment

    return use_df

It is worth reading these articles:

How To Make Your Pandas Loop 71803 Times Faster

Stop Using iterrows()

CodePudding user response:

IIUC, this is all you need to do here...

lat_segments = lat_categories.get_indexer(df['use_lat'])
lng_segments = lng_categories.get_indexer(df['use_lng'])

num_lng_grid = len(lat_categories)
lng_num_digits = len(str(num_lng_grid))

df['segment'] = (lat_segments*10**lng_num_digits) lng_segments
  • Related