Python regex and how to combine [] with?-CodePudding

Name            AverageVolume   Revenue     P/E Ratio
3M              5.03M           32.14B      18.74   
Alphabet C      2.41M           161.86B     26.01
Amazon.com      6.01M           280.52B     103.18
Apple           51.02M          267.68B     22.34
Boeing          23.63M          84.82B      20.31
Caterpillar     5.46M           53.80B      11.18
Chevron         14.33M          140.16B     58.9
Cisco           32.32M          51.55B      15.42
Coca-Cola       20.82M          37.27B      23.24
Exxon Mobil     37.47M          255.58B     13.57
Facebook        23.04M          70.70B      29.45
Goldman Sachs   4.51M           53.69B      9.97
Home Depot      6.82M           110.23B     20.43
IBM             7.17M           77.15B      11.19
Intel           33.07M          71.97B      12.77
J&J             11.54M          82.73B      23.72
JPMorgan        22.96M          67.07B      10.75
McDonalds       5.62M           21.08B      23.43
Merck&Co        14.11M          46.84B      21.64
Microsoft       54.66M          134.25B     33.04
Nike            10.38M          41.27B      33.27
Pfizer          34.01M          51.75B      13.15
Procter&Gamble  11.36M          69.59B      11.19
Raytheon Tech.  10.18M          77.05B      10.31
Tesla           20.82M          24.58B      14.41
UnitedHealth    6.24M           246.27B     20.34
Verizon         21.91M          131.87B     12.56
Visa A          13.98M          23.53B      32.21
Walmart         10.10M          523.96B     25.46
Walt Disney     20.03M          75.13B      17.98

I wish to capture companies name whose average volume starts with even number and their PE ratio ends with an odd number. Correct matches are :['Alphabet C', 'Boeing', 'Facebook', 'Goldman Sachs', 'Home Depot', 'JPMorgan', 'Tesla']

My regex script : (. ?)\s [2468][0-9]?\.[0-9] M\s [0-9] \.[0-9] B\s [0-9] \.[0-9]?[13579] I am using ? where I expect 0 or 1 digit but some reason I am not getting the desired result.

My code:

import re
with open("stocks.txt","r") as f:
    string = f.read()
    print(string)
    t = re.compile(r"(. ?)\s [2468][0-9]?\.[0-9] M\s [0-9] \.[0-9] B\s [0-9] \.[0-9]?[13579]$")        
    result = t.findall(string)
    print(result)

CodePudding user response：

You could read the whole file, and then use:

^(. ?)\s [2468]\d*\.\d\dM\s \d \.\d\dB\s \d \.\d[13579]\b

Regex demo

Note to enable multiline with re.M

CodePudding user response：

You shouldn't use regex for this. Just read your data into a dataframe (using read_csv) and use pandas boolean indexing:

evenav = df['AverageVolume'].str[0].astype(int) % 2 == 0
oddpe = df['P/E Ratio'].astype(str).str[-1].astype(int) % 2 == 1
df[evenav & oddpe]

Output:

             Name AverageVolume  Revenue  P/E Ratio
1      Alphabet C         2.41M  161.86B      26.01
4          Boeing        23.63M   84.82B      20.31
10       Facebook        23.04M   70.70B      29.45
11  Goldman Sachs         4.51M   53.69B       9.97
12     Home Depot         6.82M  110.23B      20.43
16       JPMorgan        22.96M   67.07B      10.75
24          Tesla        20.82M   24.58B      14.41

Or as a list of company names:

list(df[evenav & oddpe]['Name'].values)
# ['Alphabet C', 'Boeing', 'Facebook', 'Goldman Sachs', 'Home Depot', 'JPMorgan', 'Tesla']