Home > database >  How to convert a web scrapped text table into pandas dataframe?
How to convert a web scrapped text table into pandas dataframe?

Time:09-20

I am working on a college project where I need to scrape data from Japan Meteorological Agency website to make a earthquake forecast model

website : https://www.data.jma.go.jp/eqev/data/daily_map/20220918.html

I was successful in scraping the data using Beautiful Soup (in Japanese though)

code:

table = soup.find("pre").find(text=True)
print(table)

output:

年   月 日 時 分 秒    緯度       経度       深さ(km)  M   震央地名
-----------------------------------------------------------------------------------------
2022  9 18 00:01 33.6  35° 6.6'N 138°11.5'E   12     0.9  静岡県中部                   
2022  9 18 00:02  2.6  35° 6.4'N 138°12.1'E   13     0.0  静岡県中部                   
2022  9 18 00:07  1.3  23° 5.5'N 121°21.2'E    0     3.9  台湾付近                    
2022  9 18 00:07 26.1  37°52.6'N 141°41.9'E   56     1.4  宮城県沖                    
2022  9 18 00:14  1.5  37°29.7'N 137°12.1'E   11     0.4  石川県能登地方                 
2022  9 18 00:17  9.7  37°28.0'N 137°11.5'E   11     0.9  石川県能登地方                 
2022  9 18 00:18  3.0  37°49.6'N 141°36.0'E   49     1.1  福島県沖                    
2022  9 18 00:20 55.5  35°54.3'N 138° 3.7'E   11     0.8  長野県南部                   
2022  9 18 00:21 39.5  32°17.2'N 130°24.3'E    0     0.7  熊本県天草・芦北地方              
2022  9 18 00:23 26.7  37°50.8'N 141°35.4'E   50     1.1  福島県沖                    

2022  9 18 00:26 28.9  36°48.8'N 140°35.6'E    8     0.2  茨城県北部                   
2022  9 18 00:26 35.2  24°39.8'N 125°26.3'E   54     1.9  宮古島近海                   
2022  9 18 00:27 59.3  37°17.9'N 141°48.7'E   40     1.3  福島県沖                    
2022  9 18 00:33  4.8  23° 9.0'N 121°14.8'E    6     4.0  台湾付近                    
2022  9 18 00:36 32.0  26°21.7'N 125°51.8'E   19     4.1  沖縄本島北西沖                 
2022  9 18 00:38 56.7  37°30.3'N 137°12.4'E   12     0.3  石川県能登地方                 
2022  9 18 00:42 51.4  36°41.3'N 141°45.8'E   25     1.2  福島県沖                    
2022  9 18 00:43 44.7  32°30.3'N 130°43.9'E   10     0.6  熊本県熊本地方                 
2022  9 18 00:45 52.0  37°30.3'N 137°13.1'E   11     0.6  石川県能登地方                 
2022  9 18 00:49 20.1  35°10.5'N 139°10.1'E   12    -0.1  相模湾                     

2022  9 18 00:53 54.4  35°30.1'N 135°54.9'E   14     0.7  福井県嶺南                   
2022  9 18 00:55 18.1  37°23.1'N 141°42.6'E   39     1.1  福島県沖                    
2022  9 18 00:56  0.9  37°51.2'N 141°35.1'E   47     0.9  福島県沖                    
2022  9 18 00:59 13.3  39°59.5'N 140°30.1'E    7     0.6  秋田県内陸北部                 
2022  9 18 01:01 46.3  42°57.0'N 143°21.6'E  124     1.6  十勝地方中部                  
2022  9 18 01:02 49.6  43°14.9'N 146°26.4'E   44     2.5  根室半島南東沖                 
2022  9 18 01:05 45.1  35°31.7'N 136°33.5'E   10     0.2  岐阜県美濃中西部                
2022  9 18 01:10 18.9  37°31.3'N 137°13.7'E   11     0.4  能登半島沖                   
2022  9 18 01:10 26.6  37°31.0'N 137°13.5'E   12     0.7  能登半島沖                   
2022  9 18 01:11 10.0  36°48.5'N 140°35.2'E    7     0.2  茨城県北部                   

2022  9 18 01:12 33.2  37°45.9'N 141°37.2'E   58     1.0  福島県沖                    
2022  9 18 01:13 21.7  44°53.8'N 142° 7.1'E    0     0.5  上川地方北部                  
2022  9 18 01:15 28.0  40° 4.4'N 144°31.3'E   42     1.1  三陸沖                     
2022  9 18 01:16 14.3  26°23.1'N 125°50.7'E   15     3.7  沖縄本島北西沖                 
2022  9 18 01:18 45.9  23° 3.7'N 121°20.8'E    0     3.6  台湾付近                    
2022  9 18 01:20 57.8  37°42.9'N 141°31.5'E   53     1.2  福島県沖                    
2022  9 18 01:21 21.1  33°54.1'N 133°50.6'E   11     0.1  徳島県北部                   
2022  9 18 01:21 59.9  39° 1.7'N 140°52.5'E    8     0.0  岩手県内陸南部                 
2022  9 18 01:22 25.2  32° 2.4'N 129°59.8'E    9     0.9  天草灘                     
2022  9 18 01:22 53.4  40°14.6'N 141°13.2'E    6    -0.1  岩手県内陸北部                 

2022  9 18 01:23 56.8  37°30.5'N 137°15.4'E   13     1.1  石川県能登地方                 
2022  9 18 01:27 16.4  34°50.3'N 135°24.3'E    9     0.8  兵庫県南東部                  
2022  9 18 01:33  6.7  35°35.4'N 136°20.6'E   14     0.7  岐阜県美濃中西部                
2022  9 18 01:34 45.3  37°30.1'N 137°13.3'E   11     0.4  石川県能登地方                 
2022  9 18 01:36 47.0  35°10.7'N 137°43.2'E   16     0.2  愛知県東部                   
2022  9 18 01:38 53.0  26°15.5'N 125°53.0'E    8     3.2  沖縄本島北西沖                 
2022  9 18 01:40  2.9  37°12.4'N 141°24.0'E   11     1.3  福島県沖                    
2022  9 18 01:44  7.0  23° 0.9'N 121°24.6'E    2     3.9  台湾付近                    
2022  9 18 01:48  1.1  36° 6.2'N 137°41.7'E    8     0.2  長野県中部                   
2022  9 18 01:54  8.1  43°46.8'N 145° 1.5'E  161     2.5  根室地方北部                  

2022  9 18 01:57 25.1  37°50.1'N 141°50.4'E   35     1.6  宮城県沖                    
2022  9 18 01:58 48.7  33°46.9'N 141°33.8'E   33     3.5  八丈島東方沖                  
2022  9 18 01:59 28.7  39° 3.7'N 140°52.2'E    8     0.1  岩手県内陸南部                 
2022  9 18 01:59 40.6  41°44.9'N 144°13.3'E   24     1.5  十勝沖                     
2022  9 18 02:01 16.5  36°48.0'N 141°21.8'E    8     2.1  茨城県沖                    
2022  9 18 02:01 24.3  35°11.2'N 132°34.7'E   12     0.6  島根県西部                   
2022  9 18 02:02 32.0  34°37.3'N 136°50.1'E   14     0.7  伊勢湾                     
2022  9 18 02:07  2.5  38°48.8'N 142° 1.9'E   53     1.1  宮城県沖                    
2022  9 18 02:12 50.6  37°56.1'N 141°43.3'E   53     1.3  宮城県沖                    
2022  9 18 02:13 32.1  41°41.2'N 143°50.0'E   18     1.7  十勝沖                     

(The output is shortened due to character limit)

I tried to translate it but was unsuccessful then thought of extracting just the numerical data for predictions.

It would be really helpful if someone explains me how to convert this text table into dataframe for predictions?

ThankYou!

CodePudding user response:

Try this example (probably the column headers need to be rearranged - but I don't know Japanese so I'm not sure how):

import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup


url = "https://www.data.jma.go.jp/eqev/data/daily_map/20220918.html"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
t = soup.pre.get_text(strip=True)


data = StringIO(
    "\n".join(
        line
        for line in map(str.strip, t.splitlines())
        if "----" not in line and line != ""
    )
)

df = pd.read_fwf(data)
print(df.head(10).to_markdown())

Prints:

年 月 時 分 秒 緯度 経度 深さ(km) M 震央地名 Unnamed: 7 Unnamed: 8
0 2022 9 18 00:01 33.6 35° 6.6'N 138°11.5'E 12 0.9 静岡県中部
1 2022 9 18 00:02 2.6 35° 6.4'N 138°12.1'E 13 0 静岡県中部
2 2022 9 18 00:07 1.3 23° 5.5'N 121°21.2'E 0 3.9 台湾付近
3 2022 9 18 00:07 26.1 37°52.6'N 141°41.9'E 56 1.4 宮城県沖
4 2022 9 18 00:14 1.5 37°29.7'N 137°12.1'E 11 0.4 石川県能登地方
5 2022 9 18 00:17 9.7 37°28.0'N 137°11.5'E 11 0.9 石川県能登地方
6 2022 9 18 00:18 3 37°49.6'N 141°36.0'E 49 1.1 福島県沖
7 2022 9 18 00:20 55.5 35°54.3'N 138° 3.7'E 11 0.8 長野県南部
8 2022 9 18 00:21 39.5 32°17.2'N 130°24.3'E 0 0.7 熊本県天草・芦北地方
9 2022 9 18 00:23 26.7 37°50.8'N 141°35.4'E 50 1.1 福島県沖

CodePudding user response:

just copy the entire table and use this:

import pandas as pd

df = pd.read_clipboard()

tip #1: about web scrapping, if the table itself already exists on the website you can actually scrape it using this:

df = pd.read_html("insert your url")

tip #2: usually websites like this they have an API, that returns the data to you in a JSON format, you just sign up and get an API key and a password and you ping their server and it returns to you the data you requested.

df = pd.read_json("file_path.json")
  • Related