Home > Mobile >  How to scrape all App Store apps on a Google Play Search
How to scrape all App Store apps on a Google Play Search

Time:04-08

I am trying to use find_all() but seem to be having issues finding the tags for the specific information.

I would love to build a wrapper so I can extract data from the app store such as title, publisher, etc (public HTML info).

The code isn't correct, I am aware. The closest thing I could find to a div identifier is "c4".

Any insight helps.

# Imports
import requests
from bs4 import BeautifulSoup

# Data Defining
url = "https://play.google.com/store/search?q=weather app"

# Getting HTML

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
soup.get_text()

results = soup.find_all(id="c4")

I am expecting an output of different weather apps and information:

Weather App 1
Develop Company 1

Google Weather App
Develop Company 2

Bing Weather App
Bing Developers

CodePudding user response:

I'm getting following output from the url

from bs4 import BeautifulSoup
import requests

url='https://play.google.com/store/search?q=weather app'
req=requests.get(url)

soup = BeautifulSoup(req.content, 'html.parser')

cards= soup.find_all("div",class_="vU6FJ p63iDd")

for card in cards:
    app_name= card.find("div",class_="WsMG1c nnK0zc").text
    company = card.find("div",class_="KoLSrc").text
    print("Name: "   app_name)
    print("Company: "   company)

Output:

Name: Weather app
Company: Accurate Weather Forecast & Weather Radar Map  
Name: AccuWeather: Weather Radar
Company: AccuWeather
Name: Weather Forecast - Accurate Local Weather & Widget
Company: Weather Forecast & Widget & Radar
Name: 1Weather Forecasts & Radar
Company: OneLouder Apps
Name: MyRadar Weather Radar
Company: ACME AtronOmatic LLC
Name: Weather data & microclimate : Weather Underground
Company: Weather Underground
Name: Weather & Widget - Weawow
Company: weawow weather app
Name: Weather forecast
Company: smart-pro android apps
Name: The Secret World of Weather: How to Read Signs in Every Cloud, Breeze, Hill, Street, Plant, Animal, and Dewdrop
Company: Tristan Gooley
Name: The Weather Machine: A Journey Inside the Forecast
Company: Andrew Blum
Name: The Mobile Mind Shift: Engineer Your Business to Win in the Mobile Moment
Company: Julie Ask
Name: Together: The Healing Power of Human Connection in a Sometimes Lonely World
Company: Vivek H. Murthy
Name: The Meadow
Company: James Galvin
Name: The Ancient Egyptian Culture Revealed, 2nd edition
Company: Moustafa Gadalla
Name: The Ancient Egyptian Culture Revealed, 2nd edition
Company: Moustafa Gadalla
Name: Chaos Theory
Company: Introbooks Team
Name: Survival Training: Killer Tips for Toughness and Secret Smart Survival Skills       
Company: Wesley Jones
Name: Kiasunomics 2: Economic Insights for Everyday Life
Company: Ang Swee Hoon
Name: Summary of We Are The Weather by Jonathan Safran Foer
Company: QuickRead
Name: Learn Swift by Building Applications: Explore Swift programming through iOS app development
Company: Emil Atanasov
Name: Weather Hazard Warning Application in Car-to-X Communication: Concepts, Implementations, and Evaluations
Company: Attila Jaeger
Name: Mobile App Development with Ionic, Revised Edition: Cross-Platform Apps with Ionic, 
Angular, and Cordova
Company: Chris Griffith
Name: Good Application Makes a Good Roof Better: A Simplified Guide: Installing Laminated 
Asphalt Shingles for Maximum Life & Weather Protection
Company: ARMA Asphalt Roofing Manufacturers Association
Name: The Secret World of Weather: How to Read Signs in Every Cloud, Breeze, Hill, Street, Plant, Animal, and Dewdrop
Company: Tristan Gooley
Name: The Weather Machine: A Journey Inside the Forecast
Company: Andrew Blum
Name: Space Physics and Aeronomy, Space Weather Effects and Applications
Company: Book 5
Name: How to Build Android Apps with Kotlin: A hands-on guide to developing, testing, and 
publishing your first apps with Android
Company: Alex Forrester
Name: Android 6 for Programmers: An App-Driven Approach, Edition 3
Company: Paul J. Deitel

CodePudding user response:

Note Working on the basis of extremely dynamically generated identifiers such as class names is only partially reliable.

The strategy should therefore be based on much more constant identifiers such as tags and their structures or, in some cases, ids:

for e in soup.select('a[href^="/store/apps/details?id"]:has(div[title])'):
    data.append({
        'title': e.select_one('div[title]').get('title'),
        'company':e.find_next('a').text,
        'url':'https://play.google.com' e.get('href')
    })

Example

Also be aware a real app search should reference to https://play.google.com/store/search?q=weather&c=apps and to get all these apps you have to deal with dynamic rendered / loaded content and scrolling - Thats why this example bases on selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://play.google.com/store/search?q=weather&c=apps'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 10)

while True:
    last_height = driver.execute_script("return window.pageYOffset   window.innerHeight")
    e =  wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'a[href="https://policies.google.com/privacy"]')))[-1]
    driver.execute_script("arguments[0].scrollIntoView();",e)
    time.sleep(0.5)

    if last_height == driver.execute_script("return window.pageYOffset   window.innerHeight"):
        break
    else:
        last_height = driver.execute_script("return window.pageYOffset   window.innerHeight")

soup = BeautifulSoup(driver.page_source)

data = []

for e in soup.select('a[href^="/store/apps/details?id"]:has(div[title])'):
    data.append({
        'title': e.select_one('div[title]').get('title'),
        'company':e.find_next('a').text,
        'url':'https://play.google.com' e.get('href')
    })

print(pd.DataFrame(data).to_csv('app.csv', index=False)

Output

title company url
Weather app Accurate Weather Forecast & Weather Radar Map https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel
The Weather Channel - Radar The Weather Channel https://play.google.com/store/apps/details?id=com.weather.Weather
AccuWeather: Weather Radar AccuWeather https://play.google.com/store/apps/details?id=com.accuweather.android
Weather by WeatherBug WeatherBug https://play.google.com/store/apps/details?id=com.aws.android
Weather Forecast - Accurate Local Weather & Widget Weather Forecast & Widget & Radar https://play.google.com/store/apps/details?id=com.accurate.weather.forecast.live
The Weather Channel Weather Group, LLC https://play.google.com/store/apps/details?id=com.weathergroup.twc
WeatherNation WeatherNation TV, Inc. https://play.google.com/store/apps/details?id=com.weathernationtv
1Weather Forecasts & Radar OneLouder Apps https://play.google.com/store/apps/details?id=com.handmark.expressweather
Weather data & microclimate : Weather Underground Weather Underground https://play.google.com/store/apps/details?id=com.wunderground.android.weather
Weather & Widget - Weawow weawow weather app https://play.google.com/store/apps/details?id=com.weawow
Weather forecast smart-pro android apps https://play.google.com/store/apps/details?id=com.graph.weather.forecast.channel

...

CodePudding user response:

Make sure you're using user-agent to act as a "real" user request as sometimes you can receive a different HTML with different elements and selectors and some sort of an error because of not passing user-agent to request headers.

Check what's your user-agent and update it when you can because websites might block a request if the user-agent is old, e.g using the Chrome 70 version.

Also, have a look at the SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the desired element(s) in your browser.


Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, json, lxml, re

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "weather",  # search query
    "c": "apps"      # display list of apps
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://play.google.com/store/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

apps_data = []

for app in soup.select(".mpg5gc"):
    title = app.select_one(".nnK0zc").text
    company = app.select_one(".b8cIId.KoLSrc").text
    description = app.select_one(".b8cIId.f5NCO a").text
    app_link = f'https://play.google.com{app.select_one(".b8cIId.Q9MA7b a")["href"]}'
    developer_link = f'https://play.google.com{app.select_one(".b8cIId.KoLSrc a")["href"]}'
    app_id = app.select_one(".b8cIId a")["href"].split("id=")[1]
    developer_id = app.select_one(".b8cIId.KoLSrc a")["href"].split("id=")[1]
    
    try:
        # https://regex101.com/r/SZLPRp/1
        rating = re.search(r"\d{1}\.\d{1}", app.select_one(".pf5lIe div[role=img]")["aria-label"]).group(0)
    except:
        rating = None
    
    thumbnail = app.select_one(".yNWQ8e img")["data-src"]
    
    apps_data.append({
        "title": title,
        "description": description,
        "rating": float(rating) if rating else rating, # float if rating is not None else rating or None
        "app_link": app_link,
        "developer_link": developer_link,
        "app_id": app_id,
        "developer_id": developer_id,
        "thumbnail": thumbnail
    })        

print(json.dumps(apps_data, indent=2, ensure_ascii=False))

Part of the output:

[
  {
    "title": "Weather app",
    "company": "Accurate Weather Forecast & Weather Radar Map",
    "description": "The weather channel, tiempo weather forecast, weather radar & weather map",
    "rating": 4.6,
    "app_link": "https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel",
    "developer_link": "https://play.google.com/store/apps/developer?id=Accurate Weather Forecast & Weather Radar Map",
    "app_id": "com.weather.forecast.weatherchannel",
    "developer_id": "Accurate Weather Forecast & Weather Radar Map",
    "thumbnail": "https://play-lh.googleusercontent.com/GdXjVGXQ90eVNpb1VoXWGT3pff2M9oe3yDdYGIsde7W9h3s2S6FDLfo1uO-gljBZ1QXO=s128-rw"
  },
  {
    "title": "The Weather Channel - Radar",
    "company": "The Weather Channel",
    "description": "Weather Forecast & Snow Radar: local rain tracker, weather maps & alerts",
    "rating": 4.6,
    "app_link": "https://play.google.com/store/apps/details?id=com.weather.Weather",
    "developer_link": "https://play.google.com/store/apps/dev?id=5938833519207566184",
    "app_id": "com.weather.Weather",
    "developer_id": "5938833519207566184",
    "thumbnail": "https://play-lh.googleusercontent.com/RV3DftXlA7WUV7w-BpE8zM0X7Y4RQd2vBvZVv6A01DEGb_eXFRjLmUhSqdbqrEl9klI=s128-rw"
  },
  {
    "title": "Weather - By Xiaomi",
    "company": "Xiaomi Inc.",
    "description": "Always with you, rain or shine. Get temperature, forecast, AQI for you city.",
    "rating": 4.4,
    "app_link": "https://play.google.com/store/apps/details?id=com.miui.weather2",
    "developer_link": "https://play.google.com/store/apps/dev?id=5113340212256272297",
    "app_id": "com.miui.weather2",
    "developer_id": "5113340212256272297",
    "thumbnail": "https://play-lh.googleusercontent.com/sAZ2AZ16r5ThHiYCTWg8x1UUNQOhsxexRaDrDZKDlUy1hoZlggen6QogpJmQk8BwmgI=s128-rw"
  }, ... other results
]

An alternative solution could be to use Google Play Store API from SerpApi. It's a paid API with a free plan.

The difference is that there's no need to create a parser from scratch, maintain it, figure out how to extract data, bypass blocks from Google or other search engines.

Code to integrate:

from serpapi import GoogleSearch
import json

params = {
    "api_key": "API KEY",      # your serpapi api key
    "engine": "google_play",   # search engine
    "hl": "en",                # language
    "store": "apps",           # apps search
    "gl": "us",                # contry to search from. Different country displays different.
    "q": "weather"             # search qeury
}

search = GoogleSearch(params)  # where data extracts
results = search.get_dict()    # JSON -> Python dictionary

apps_data = []

for apps in results["organic_results"]:
    for app in apps["items"]:
        apps_data.append({
            "title": app.get("title"),
            "link": app.get("link"),
            "description": app.get("description"),
            "product_id": app.get("product_id"),
            "rating": app.get("rating"),
            "thumbnail": app.get("thumbnail"),
            })

print(json.dumps(apps_data, indent=2, ensure_ascii=False))

Part of the output (contains other data you can see in the Playground):

[
  {
    "title": "Weather app",
    "link": "https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel",
    "description": "The weather channel, tiempo weather forecast, weather radar & weather map",
    "product_id": "com.weather.forecast.weatherchannel",
    "rating": 4.7,
    "thumbnail": "https://play-lh.googleusercontent.com/GdXjVGXQ90eVNpb1VoXWGT3pff2M9oe3yDdYGIsde7W9h3s2S6FDLfo1uO-gljBZ1QXO=s128-rw"
  },
  {
    "title": "The Weather Channel - Radar",
    "link": "https://play.google.com/store/apps/details?id=com.weather.Weather",
    "description": "Weather Forecast & Snow Radar: local rain tracker, weather maps & alerts",
    "product_id": "com.weather.Weather",
    "rating": 4.6,
    "thumbnail": "https://play-lh.googleusercontent.com/RV3DftXlA7WUV7w-BpE8zM0X7Y4RQd2vBvZVv6A01DEGb_eXFRjLmUhSqdbqrEl9klI=s128-rw"
  },
  {
    "title": "AccuWeather: Weather Radar",
    "link": "https://play.google.com/store/apps/details?id=com.accuweather.android",
    "description": "Your local weather forecast, storm tracker, radar maps & live weather news",
    "product_id": "com.accuweather.android",
    "rating": 4.0,
    "thumbnail": "https://play-lh.googleusercontent.com/EgDT3XrIaJbhZjINCWsiqjzonzqve7LgAbim8kHXWgg6fZnQebqIWjE6UcGahJ6yugU=s128-rw"
  },
  {
    "title": "Weather by WeatherBug",
    "link": "https://play.google.com/store/apps/details?id=com.aws.android",
    "description": "The Most Accurate Weather Forecast. Alerts, Radar, Maps & News from WeatherBug",
    "product_id": "com.aws.android",
    "rating": 4.7,
    "thumbnail": "https://play-lh.googleusercontent.com/_rZCkobaGZzXN3iquPr4u2KOe7C-ljnrSkBfw6sVL1kpUfq3sBl5MoRJEisBSnxaD-M=s128-rw"
  }, ... other results
]

Disclaimer, I work for SerpApi.

  • Related