Home > Enterprise >  How do I remove lines that are repeating and contains certain words from text file?
How do I remove lines that are repeating and contains certain words from text file?

Time:12-15

I'm trying to remove repeated lines and lines containing certain words from scraped data. I searched for various codes but they are not working :(

This is the code. Only the first part works, that removes repeating lines:

openFile = open("links.txt", "r") 
writeFile = open("updatedfile.txt", "w") 
#Store traversed lines
tmp = set() 
for txtLine in openFile: 
#Check new line
    if txtLine not in tmp: 
        writeFile.write(txtLine) 
#Add new traversed line to tmp 
        tmp.add(txtLine)         
openFile.close() 
writeFile.close()

sleep(5)

with open("updatedfile.txt", "r") as fp:
    lines = fp.readlines()

with open("updatedfile.txt", "w") as fp:
    for line in lines:
        if line.strip("\n") != "search":
            fp.write(line)

This is the links.txt file

https://twitter.com/search?q=#BTC&src=hashtag_click
https://twitter.com/search?q=#ADA&src=hashtag_click
https://twitter.com/search?q=#LTC&src=hashtag_click
https://twitter.com/search?q=#CAKE&src=hashtag_click
https://twitter.com/Marie62943337
https://twitter.com/Marie62943337
https://twitter.com/Fathur0501
https://twitter.com/Fathur0501
https://twitter.com/BogdanMar93
https://twitter.com/BogdanMar93
https://t.[spaced because body cannot contain short url]co/74ZzkVwa2W
https://t. co/Gv2tyiWfAk

I want the output to be:

https://twitter.com/Marie62943337
https://twitter.com/Fathur0501
https://twitter.com/BogdanMar93

Thanks for your help.

CodePudding user response:

Check this code. I think it works

with open("test.txt", "r") as fp:
    lines = fp.readlines()
fp.close()

unique = set() 

with open("test.txt", "w") as fp:
    for line in lines:
        if "search" not in line and line not in unique and "twitter.com" in line:
            fp.write(line)
            unique.add(line)

Please share the query in the comment below.

CodePudding user response:

Maybe you want to use this, with 'in':

lines = ['https://twitter.com/search?q=#CAKE&src=hashtag_click', 'https://twitter.com/Marie62943337']
for line in lines:
    if 'search' not in line:
        print(line)
  • Related