I've got a pandas dataframe like this:
Title Pgid Views30 Title_to
===============================================================================
30 Хо_Ен_Чон 9048639 284950 Чон_Хо_Ён
98 Mail.ru_Group 9018641 153082 VK_(компания)
105 Паша_Техник 9070663 143053 Техник,_Паша
303 Freeware 6166716 79998 Бесплатное_программное_обеспечение
399 СССР 1007 69349 Союз_Советских_Социалистических_Республик
The data contains over 1.2 million entries from Wikipedia page data:
Title
= page titlePgid
= page IDViews30
= monthly page viewsTitle_to
= title of the page that this page redirects to (orNaN
if no redirect)
Now I want to make a new column Pgid_to
with the page IDs of the redirect target pages for all pages with Title_to
!= NaN
. That is, collect Pgid
from Title
= Title_to
for all entries.
My current solution is straightforward:
def cond(title_to):
try:
# get Pgid of page whose title == title_to
return df.loc[df['Title'] == title_to, 'Pgid']
except:
# return NaN on failure to locate
return np.NaN
# make new column by applying search element-wise
df['Pgid_to'] = df['Title_to'].apply(cond)
However, this algorithm is likely to take polynomial time (N^2), which for 1.2 MM entries means 1.4 trillion operations! Is is possible to optimize? Possibly, is there a vectorized solution?
CodePudding user response:
np.where()
is vectorized and hopefully will save the day. Kindly try:
df['Pgid_to'] = np.where(df[Title'] == df['Title_co'], 'Pgid',np.nan)
If you want to compare it against nan:
df['Pgid_to'] = np.where(df[Title'].isna(),np.nan, 'Pgid')
CodePudding user response:
OK, found a solution by simply merging the dataframe with itself!
df_merged = df.merge(df, 'left', left_on='Title',
right_on='Title_to', suffixes=(None, '_from'))
This produces a new dataframe with the 'Pgid_from' column (among others), which can then be used to group the data.