Create new column with loc() iloc() and apply after using groupby see pictures-CodePudding

Given this code:

from bs4 import BeautifulSoup
from lxml import etree
import requests
import pandas as pd
  
URL = "https://boards.4chan.org/x/archive"
  
HEADERS = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
            (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',\
            'Accept-Language': 'en-US, en;q=0.5'})
  
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))

threads = dom.xpath('//tbody/tr')[0:2]
print(len(threads))
threads_count = 0
rows = []

for i in threads:
  thread_ids = i.xpath('.//td[1]')

  for j in thread_ids: 
    thread_id = j.text

    threads_count  = 1 
    print(f"Currently checking ID = {threads_count}/{(len(threads))}", end="")
    url2 = (f'https://boards.4chan.org/x/thread/{thread_id}')
    webpage = requests.get(url2)
    soup = BeautifulSoup(webpage.content, "html.parser")
    dom = etree.HTML(str(soup))

    threads_containers = dom.xpath('//div[contains(@class,"Container")]')

    for x in threads_containers:
      post_id = x.xpath('.//span[@]/a[@title="Reply to this post"]')[0].text
      content = x.xpath('.//blockquote[@]/descendant::text()')
      
      new_content = []
      for el in content:
        if thread_id in el:
          el = el " (OP)"
          new_content.append(el "\n")
        else:
          new_content.append(el "\n")

      rows.append([thread_id, post_id, ''.join(new_content)])

    print("\r", end="")

df = pd.DataFrame(rows, columns=['Threads IDs', 'Posts IDs', 'Content'])
df

I get a the following DF (yours can be different because it scraps "live" archives):

Then I use this code:

df1 = df[['Threads IDs', 'Posts IDs']].groupby('Threads IDs').count().rename(columns={'Posts IDs': 'Number of Posts'})
df1

to get the following result:

Now, what I would like, its creating a third column named "What", applying the code below but for the entire dataframe:

df.loc[df['Threads IDs'] == '31904499', 'Content'].iloc[0]

I tried to play with "apply" and the code above, without success.

If I resume: After using "groupby" to get the new DF with the "Number of Posts" by "Threads IDs", I would like to create a third column, named "What", which contains for each row the first value of "Content" ([0]) corresponding to the respective "Thread ID".

CodePudding user response：

Try this

# df1.index is unique Thread IDs, so map it
df['What'] = df['Thread IDs'].map(df1['Number of Posts'])

CodePudding user response：

This will group the dataframe by 'Threads IDs' and show the first row for each group

df.groupby('Threads IDs').first()
# Out: 
#              Posts IDs                                   Content
Threads IDs                                                             
# 31886119     31886119  Are Greys simply humans from the future? \nIf ...
# 31901943     31901943  In the video game Event 0, one of the main thi...

To get a dataframe with counts and the content of the first post for each thread:

df
#         Threads IDs Posts IDs                                            # Content
# 0      31886119  31886119  Are Greys simply humans from the future? \nIf ...
# 1      31886119  31886125  >>31886119 (OP)\nYou don't even know if 'Greys...
# 2      31886119  31886142  Probes for higher concious beings. They attach...                                               ...
# ..          ...       ...
# 173    31901943  31904460  >>31902625\n>The moment the replikas started t...
# 174    31901943  31904484  >>31902874\nDo it, Oblivion is always the game...

df.groupby('Threads IDs').agg(['count', 'first'])['Content'] \
  .rename(columns={'count':'Number Of Posts'}) \
  .reset_index()
# Out: 
#   Threads IDs  Number Of Posts                                              first
# 0    31886119              147  Are Greys simply humans from the future? \nIf ...
# 1    31901943               31  In the video game Event 0, one of the main thi...

CodePudding user response：

After digging stackoverflow a little bit more, I've finally found exactly what I wanted:

df1 = df.groupby('Threads IDs')
df2 = df1.agg({'Posts IDs':'count'}).join(df1['Content'].nth(3)).fillna('Not enough data.').rename(columns={"Posts IDs": "Number of Posts"})
df2

This code offers more flexbility.

Result:

             Number of Posts                                                 Content
Threads IDs                                                              
31909432                   8       >>31909432 (OP)\nTypical toxic white chud. \nT...
31910606                   2                                        Not enough data.                                    Not enough data.

This code let me chose whatever nth() value I want from 'Content' and print the text I want when there's not data available.