Processing a big file I some lines are not loaded, this append e.g with
$ cat load.py
import pandas as pd
df = read_csv('big.csv', on_bad_lines='warn')
$
using err and stdout:
$ python load.py 2> err.log
$
$ cat err.log
line 19585196: expected 6 fields, saw 7
line 19703832: expected 6 fields, saw 8
line 1117482923: expected 6 fields, saw 9
$
We get lines number in err.log (nothing to grep obviously)
Then I need to see what are this files with different structure:
$ sed -n -f '19585196p;19703832p;1117482923p' big.csv
It work very well, incredibly fast even with tremendous files.
My problem is when I have thousands lines to extract bash complain that the list is too long.
Let's pus the sed command in a file (the real one with thousand of lines programmatically created in Python:
$ cat cmd.sed
# cmd
'19585196p;19703832p;111748p'
$
$ sed -f cmd.sed big.csv > /tmp/out
sed: file cmd.sed line 2: unknown command: `''
$
CodePudding user response:
You may use this sed
command
sed -n -f <(sed -n 's/^line \([0-9]*\).*/\1p/p' err.log) big.csv
This uses bash
process substitution (<(...)
): the output of the command
sed -n 's/^line \([0-9]*\).*/\1p/p' err.log
is seen by the calling sed
command as if it was a file. Thus, a temporary file is unnecessary.
CodePudding user response:
I have managed reproducing your problem: I have created a big file, containing interesting information on lines 19585196, 19703832 and 111748.
Then I have launched your sed
command in order to get those lines, based on the line numbers.
In order to find those lines, I had launched this command:
grep -n "interesting_information" Book1.txt | awk -F: '{print $1}'
The grep -n
makes sure that the lines with the information are preceded by their line number, and the awk
command makes sure only those line numbers are shown.
However, if I would just launch the following command:
grep "interesting_information" Book1.txt
... then I would not even need to use the line numbers: I just take the lines, containing the information.
This, as you can imagine, is just an illustration of the XY problem as mentioned by Tripleee. Can you show the way you have obtained those line numbers? That will most probably reveal a much simpler way to solve your issue.
(Please don't add this as a comment, but edit your question accordingly)