Home > Net >  Searching string among 5Gb of text files
Searching string among 5Gb of text files

Time:02-02

I have several CSV files (~25k in total) with a total size of ~5Gb. This files are in a network path and I need to search for several strings inside all these files and to save the files' names (in an output file for example) where these strings are found.

I've already tried two things:

  1. With Windows I've used findstr : findstr /s "MYSTRING" *.csv > Output.txt
  2. With Windows PowerShell: gci -r "." -filter "*.csv" | Select-String "MYSTRING" -list > .\Output.txt

I also can use Python but I don't really think it'll be faster.

There is any other way to speed up this search ?


More precision: the structure of all the files is different. They are CSV but they could be just simple TXT files

CodePudding user response:

You can use pandas to go through large csv files. You will use the read_csv() method to read the contents of the csv files, then use the query() method to filter out the columns and then use to_csv() to export those results in a separate csv file.

import pandas as pd
df = pd.read_csv('csv_file.csv')
result = df.query('column_name == "filtered_strings"')
result.to_csv('filtered_result.csv', index=False)

Hopefully this helps you.

CodePudding user response:

One of the fastest ways to search in text files using PowerShell is switch with parameters -File FILENAME -Regex:

Get-ChildItem -Recurse -Filter *.csv -PV file | ForEach-Object {
    switch -File $_.Fullname -Regex {
        'MYSTRING|ANOTHERSTRING' { $file.FullName; break }
    }
} | Set-Content output.txt

This outputs the full paths of files that contain the sub string "MYSTRING" or "ANOTHERSTRING".

switch -File $_.Fullname -Regex reads the current file line by line, applying the regular expression to each line. We use break to stop searching when the first match has been found.

Parameter -PV file (alias of -PipeLineVariable) for Get-ChildItem is used so we have access to the current file path in the switch statement. In the switch statement $_ denotes the current RegEx match, so it hides $_ from the ForEach-Object command. Using -PV we provide another name for the $_ variable of ForEach-Object.

  • Related