Home > Mobile >  Create column in Pandas based on dataframe query condition
Create column in Pandas based on dataframe query condition

Time:12-02

I've got a pandas dataframe like this:

    Title           Pgid    Views30 Title_to
===============================================================================
30  Хо_Ен_Чон       9048639 284950  Чон_Хо_Ён
98  Mail.ru_Group   9018641 153082  VK_(компания)
105 Паша_Техник     9070663 143053  Техник,_Паша
303 Freeware        6166716 79998   Бесплатное_программное_обеспечение
399 СССР            1007    69349   Союз_Советских_Социалистических_Республик

The data contains over 1.2 million entries from Wikipedia page data:

  • Title = page title
  • Pgid = page ID
  • Views30 = monthly page views
  • Title_to = title of the page that this page redirects to (or NaN if no redirect)

Now I want to make a new column Pgid_to with the page IDs of the redirect target pages for all pages with Title_to != NaN. That is, collect Pgid from Title = Title_to for all entries.

My current solution is straightforward:

def cond(title_to):
    try:
        # get Pgid of page whose title == title_to
        return df.loc[df['Title'] == title_to, 'Pgid']
    except:
        # return NaN on failure to locate
        return np.NaN
# make new column by applying search element-wise    
df['Pgid_to'] = df['Title_to'].apply(cond)

However, this algorithm is likely to take polynomial time (N^2), which for 1.2 MM entries means 1.4 trillion operations! Is is possible to optimize? Possibly, is there a vectorized solution?

CodePudding user response:

np.where() is vectorized and hopefully will save the day. Kindly try:

df['Pgid_to'] = np.where(df[Title'] == df['Title_co'], 'Pgid',np.nan)

If you want to compare it against nan:

df['Pgid_to'] = np.where(df[Title'].isna(),np.nan, 'Pgid')

CodePudding user response:

OK, found a solution by simply merging the dataframe with itself!

df_merged = df.merge(df, 'left', left_on='Title', 
                     right_on='Title_to', suffixes=(None, '_from'))

This produces a new dataframe with the 'Pgid_from' column (among others), which can then be used to group the data.

  • Related