Home > Software engineering >  How to remove unwanted data from a data column using pandas DataFrame
How to remove unwanted data from a data column using pandas DataFrame

Time:12-15

I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.

Script

import requests
import pandas as pd

isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']

dfs = []
for isin in isins:
    html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
    dfs.extend(pd.read_html(html))

df = pd.concat(dfs) 
print(df)

Expected Output

        Day     Date        Open   High    Low  Close 
      Monday  Dec 13, 2021  77.77  77.77  77.77  77.77 
      Friday  Dec 10, 2021  77.61  77.61  77.61  77.61  
    Thursday  Dec 09, 2021  77.60  77.60  77.60  77.60  
   Wednesday  Dec 08, 2021  77.47  77.47  77.47  77.47  
     Tuesday  Dec 07, 2021  77.64  77.64  77.64  77.64  

Current output

                                       Date   Open   High    Low  Close Volume
 Monday, December 13, 2021Mon, Dec 13, 2021  77.77  77.77  77.77  77.77  00.00
  Friday, December 10, 2021Fri, Dec 10, 2021  77.61  77.61  77.61  77.61   ----
Thursday, December 09, 2021Thu, Dec 09, 2021  77.60  77.60  77.60  77.60   ----
Wednesday, December 08, 2021Wed, Dec 08, 2021  77.47  77.47  77.47  77.47   ----
 Tuesday, December 07, 2021Tue, Dec 07, 2021  77.64  77.64  77.64  77.64   ----

Thanks in advanced!

CodePudding user response:

I added the necessary steps to your code:

import requests
import pandas as pd

isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']

dfs = []
for isin in isins:
    html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
    dfs.extend(pd.read_html(html))

df = pd.concat(dfs)

# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')

print(df)

After those three operations, df looks like this:

          Day          Date   Open   High    Low  Close
0      Monday  Dec 13, 2021  77.77  77.77  77.77  77.77
1      Friday  Dec 10, 2021  77.61  77.61  77.61  77.61
2    Thursday  Dec 09, 2021  77.60  77.60  77.60  77.60
3   Wednesday  Dec 08, 2021  77.47  77.47  77.47  77.47
4     Tuesday  Dec 07, 2021  77.64  77.64  77.64  77.64
5      Monday  Dec 06, 2021  77.70  77.70  77.70  77.70
6      Friday  Dec 03, 2021  77.72  77.72  77.72  77.72
...

CodePudding user response:

I would use regex here to split. Then you can combine them and parse anyway you like afterwards:

import requests
import pandas as pd


isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']

dfs = []
for isin in isins:
    html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
    dfs.extend(pd.read_html(html))

df = pd.concat(dfs) 
print(df)
    


df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha']   df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)

Output:

print(df)
                            Date   Open   High    Low  Close Volume
0      Monday, December 13, 2021  77.77  77.77  77.77  77.77   ----
1      Friday, December 10, 2021  77.61  77.61  77.61  77.61   ----
2    Thursday, December 09, 2021  77.60  77.60  77.60  77.60   ----
3   Wednesday, December 08, 2021  77.47  77.47  77.47  77.47   ----
4     Tuesday, December 07, 2021  77.64  77.64  77.64  77.64   ----
5      Monday, December 06, 2021  77.70  77.70  77.70  77.70   ----
6      Friday, December 03, 2021  77.72  77.72  77.72  77.72   ----
7    Thursday, December 02, 2021  77.56  77.56  77.56  77.56   ----
8   Wednesday, December 01, 2021  77.51  77.51  77.51  77.51   ----
9     Tuesday, November 30, 2021  77.52  77.52  77.52  77.52   ----
10     Monday, November 29, 2021  77.37  77.37  77.37  77.37   ----
11     Friday, November 26, 2021  77.44  77.44  77.44  77.44   ----
12   Thursday, November 25, 2021  77.11  77.11  77.11  77.11   ----
13  Wednesday, November 24, 2021  77.10  77.10  77.10  77.10   ----
14    Tuesday, November 23, 2021  77.02  77.02  77.02  77.02   ----
15     Monday, November 22, 2021  77.32  77.32  77.32  77.32   ----
16     Friday, November 19, 2021  77.52  77.52  77.52  77.52   ----
17   Thursday, November 18, 2021  77.38  77.38  77.38  77.38   ----
18  Wednesday, November 17, 2021  77.26  77.26  77.26  77.26   ----
19    Tuesday, November 16, 2021  77.24  77.24  77.24  77.24   ----
20     Monday, November 15, 2021  77.30  77.30  77.30  77.30   ----
0      Monday, December 13, 2021  11.09  11.09  11.09  11.09   ----
1      Friday, December 10, 2021  11.08  11.08  11.08  11.08   ----
2    Thursday, December 09, 2021  11.08  11.08  11.08  11.08   ----
3   Wednesday, December 08, 2021  11.06  11.06  11.06  11.06   ----
4     Tuesday, December 07, 2021  11.08  11.08  11.08  11.08   ----
5      Monday, December 06, 2021  11.09  11.09  11.09  11.09   ----
6      Friday, December 03, 2021  11.08  11.08  11.08  11.08   ----
7    Thursday, December 02, 2021  11.08  11.08  11.08  11.08   ----
8   Wednesday, December 01, 2021  11.05  11.05  11.05  11.05   ----
9     Tuesday, November 30, 2021  11.07  11.07  11.07  11.07   ----
10     Monday, November 29, 2021  11.07  11.07  11.07  11.07   ----
11     Friday, November 26, 2021  11.08  11.08  11.08  11.08   ----
12   Thursday, November 25, 2021  11.04  11.04  11.04  11.04   ----
13  Wednesday, November 24, 2021  11.03  11.03  11.03  11.03   ----
14    Tuesday, November 23, 2021  11.04  11.04  11.04  11.04   ----
15     Monday, November 22, 2021  11.07  11.07  11.07  11.07   ----
16     Friday, November 19, 2021  11.09  11.09  11.09  11.09   ----
17   Thursday, November 18, 2021  11.06  11.06  11.06  11.06   ----
18  Wednesday, November 17, 2021  11.05  11.05  11.05  11.05   ----
19    Tuesday, November 16, 2021  11.05  11.05  11.05  11.05   ----
20     Monday, November 15, 2021  11.05  11.05  11.05  11.05   ----
  • Related