I have several CSV files (~25k in total) with a total size of ~5Gb. This files are in a network path and I need to search for several strings inside all these files and to save the files' names (in an output file for example) where these strings are found.
I've already tried two things:
- With Windows I've used findstr :
findstr /s "MYSTRING" *.csv > Output.txt
- With Windows PowerShell:
gci -r "." -filter "*.csv" | Select-String "MYSTRING" -list > .\Output.txt
I also can use Python but I don't really think it'll be faster.
There is any other way to speed up this search ?
More precision: the structure of all the files is different. They are CSV but they could be just simple TXT files
CodePudding user response:
You can use pandas to go through large csv files. You will use the read_csv() method to read the contents of the csv files, then use the query() method to filter out the columns and then use to_csv() to export those results in a separate csv file.
import pandas as pd
df = pd.read_csv('csv_file.csv')
result = df.query('column_name == "filtered_strings"')
result.to_csv('filtered_result.csv', index=False)
Hopefully this helps you.
CodePudding user response:
One of the fastest ways to search in text files using PowerShell is switch
with parameters -File FILENAME -Regex
:
Get-ChildItem -Recurse -Filter *.csv -PV file | ForEach-Object {
switch -File $_.Fullname -Regex {
'MYSTRING|ANOTHERSTRING' { $file.FullName; break }
}
} | Set-Content output.txt
This outputs the full paths of files that contain the sub string "MYSTRING" or "ANOTHERSTRING".
switch -File $_.Fullname -Regex
reads the current file line by line, applying the regular expression to each line. We use break
to stop searching when the first match has been found.
Parameter -PV file
(alias of -PipeLineVariable
) for Get-ChildItem
is used so we have access to the current file path in the switch
statement. In the switch statement $_
denotes the current RegEx match, so it hides $_
from the ForEach-Object
command. Using -PV
we provide another name for the $_
variable of ForEach-Object
.