Filter pandas dataframe records based on condition with multiple quantifier regex-CodePudding

I am trying to filter some records from pandas dataframe. The dataframe named 'df' consist of two columns Sl.No. and doc_id(which contains urls) is as follows:

df

Sl.No.                        doc_id  
1.            https://www.durangoherald.com/articles/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/
2.            https://allafrica.com/stories/202206100634.html
3.            https://www.sfgate.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php
4.            https://impakter.com/goldrush-for-fossil-fuels/
5.            https://www.streetinsider.com/Business Wire/EDF Renewables North America Awarded Three Contracts totaling 1 Gigawatt of Solar   Storage in New York/20201448.html
6.            https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product
7.            https://www.breitbart.com:443/europe/2022/06/10/climate-crazy-bojo-ignored-calls-to-cut-green-taxes-to-ease-cost-of-living-pressures
8.            https://news.yahoo.com/ship-owners-sought-co2-exemption-171700544.html
9.            https://www.mychesco.com/a/news/regional/active-world-club-lists-new-carbon-credits-crypto-currency-token-carbon-coin
10.           http://www.msn.com/en-nz/health/nutrition/scientists-reveal-plans-to-make-plant-based-cheese-out-of-yellow-peas/ar-AAYjzHx
11.           https://www.chron.com/news/article/Ship-owners-sought-CO2-exemption-when-the-sea-17233413.php|https://wtmj.com/national/2022/06/10/ship-owners-sought-co2-exemption-when-the-sea-gets-too-wavy/

I want to filter a few records from the above dataframe. The needed urls are in a list. I have used the following process to subset the dataframe.

   needed_url = [https://impakter.com/goldrush-for-fossil-fuels/,   https://www.streetinsider.com/Business Wire/EDF Renewables North America Awarded Three Contracts totaling 1 Gigawatt of Solar   Storage in New York/20201448.html,
 https://markets.financialcontent.com/stocks/article/nnwire-2022-6-10-greenenergybreaks-fuelpositive-corporation-tsxv-nhhh-otcqb-nhhhf-addressing-price-volatility-supply-uncertainty-with-flagship-product]

  df[df.doc_id.str.contains('|'.join(needed_url),na=False, regex=True)]

But it is showing error :

  error: multiple repeat at position 417

I presume it is due to multiple quantifier ' ' in the following urls:

  https://www.streetinsider.com/Business Wire/EDF Renewables North America Awarded Three Contracts totaling 1 Gigawatt of Solar   Storage in New York/20201448.html

I have tried to escape the ' ' through re.escape() but no luck. I am new to regex and it would be helpful if it can be solved. Objective is to filter the dataframe based on the matching url in the list. Thanks in anticipation.

CodePudding user response：

pandas dataframe isin function will take list as input and search for the values in specified column.

   print(df)
print(df[df['col2'].isin(needed_url)])

output: df:

  col1                                               col2
0    1  https://www.durangoherald.com/articles/ship-ow...
1    2    https://allafrica.com/stories/202206100634.html
2    3  https://www.sfgate.com/news/article/Ship-owner...
3    4    https://impakter.com/goldrush-for-fossil-fuels/
4    5  https://www.streetinsider.com/Business Wire/ED...
5    6  https://markets.financialcontent.com/stocks/ar...

formatted output:

   col1                                               col2
5    6  https://markets.financialcontent.com/stocks/ar...

CodePudding user response：

I didn't quite understand why you joined the needed urls with '|'. These lines returned the needed urls for me:

mask = lambda x: x in needed_url
df[df.doc_id.apply(mask)]