I am using a CSV file to read in terms that I am looking for in a text file.
I would like to use a wildcard or 'Like' as a term. I am not looking for wildcard for text document name but for the terms I'm searching for found in the CSV file.
For example: Terms in CSV file to search in Text File, each term is in it's own row.
test* #search for test, tests, testing, etc
project
list
table
chair
Is there a wildcard that can be used in the CSV file so that all variants of that word is returned? I want to place the wildcard in the CSV file.
Below is my code, the file it reads the terms I'm searching for is contract_search_terms.csv
def main():
txt_filepaths = glob.glob("**/*.txt", recursive=True)
start = time.time()
results = {}
new_results = [] #place dictionary values organized per key = value instead of key = tuple of values
term_filename = open('contract_search_terms.csv', 'r') #file where terms to be searched is found
term_file = csv.DictReader(term_filename)
search_terms =[] #append terms into a list, this means that we can use several columns and append them all to one list.
#############search for the terms imported into the list############################################################
for col in term_file:
search_terms.append(col['Contract Terms']) #indicate what columns you want read in
print(search_terms) #this is just a check to show what terms are in the list
for filepath in txt_filepaths:
print(f"Searching document {filepath}") #print what file the code is reading
search_terms = search_terms #place terms list into search_terms so that the code below can read it when looping through the contracts.
filename = os.path.basename(filepath)
found_terms = {} #dictionary of the terms found
line_number={}
for term in search_terms:
if term in found_terms.keys():
continue
with open(filepath, "r", encoding="utf-8") as fp:
lines = str(fp.readlines()).split('.') #turns contract file lines as a list
for line in lines:
if line.find(term) != -1: #line by line is '-1', paragraph '\n'
line_number = lines.index(line)
new_results.append(f"'{term}' New_Column '{filename}' New_Column '{line}' New_Column '{line_number}'") #placing the results from the print statement below into a list
print(f"Found '{term}' in document '{filename}' in line '{line_number}'")
if term in results.keys():
pages = results[''.join(term)].append([filename,line,line_number])
else:
results[term] = [filename]
#Place results into dataframe and create a csv file to use as a check if results_reports is not correct
d2=pd.DataFrame(new_results, columns=['Results']) #passing the list to a dataframe and giving it a column title
d2.to_csv('results.csv', index=True)
CodePudding user response:
In your particular case (test
, tests
, testing
), simple in operator might be sufficient, e.g.:
"test" in line
will evaluate to True
for all three words and if "test" in line:
would do the work.
In more complex cases, you might want to use regular expressions.
CodePudding user response:
In order to simplify the code flow without testing the term for a wildcard and then branching to another kind of searching I suggest you generally use the regular expression module (available in standard Python distribution):
import re
and the regular expression patterns as terms for the regular expression search:
lst_found_terms = re.findall(term, line)
if lst_found_terms != []:
for found_term in lst_found_terms:
instead of:
if line.find(term) != -1:
If you look exactly for 'test'
the regex pattern will be simply the same as in the find() function ( i.e. 'test'
) and if you want to find all words beginning with 'test'
the pattern will be r'\btest\w*'
.
In other words the 'wildcard' for any TERM ending will be enclosing the term with the prefix r'\b' and the ending r'\w*' (stored in CSV as: \bTERM\w*
).
Regular expressions provide the possibility to perform case insensitive search if you use the parameter flags=re.I
in re.findall()
.
The simple condition proposed in another answer if 'test' in line:
will evaluate to True also for 'attestation' or 'contest'. To avoid this the 'wildcard' of the regex sets a word boundary at the beginning of the term ( r'\b'
).
Notice that not limiting the number of extension characters for 'test' will find with the wildcard also 'testosterone'. You can limit the number of extending characters replacing *
with {0,3}
for maximal 3 additional characters (covering tests and testing but not testosterone).