I have downloaded a Wikipedia dump and I am trying to read it line by line. But when doing the utf8-decode I get the following error
12633: FormatException: Unfinished UTF-8 octet sequence (at offset 65536)
Stacktrace :#0 _Utf8Decoder.convertSingle (dart:convert-patch/convert_patch.dart:1789:7)
#1 Utf8Decoder.convert (dart:convert/utf.dart:351:42)
#2 Utf8Codec.decode (dart:convert/utf.dart:63:20)
#3 _MapStream._handleData (dart:async/stream_pipe.dart:213:31)
#4 _ForwardingStreamSubscription._handleData (dart:async/stream_pipe.dart:153:13)
#5 _RootZone.runUnaryGuarded (dart:async/zone.dart:1618:10)
#6 _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:341:11)
#7 _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271:7)
#8 _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:774:19)
#9 _StreamController._add (dart:async/stream_controller.dart:648:7)
#10 _StreamController.add (dart:async/stream_controller.dart:596:5)
#11 _FileStream._readBlock.<anonymous closure> (dart:io/file_impl.dart:98:19)
<asynchronous suspension>
That is this line
ar جزر_غالاباغوس 1 0
So I tried saving the file utf-8 encoded with this button
But that does not seem to work
This is my code
final filePath = p.join(
Directory.current.path,
'bin\\migrate_most_views\\data\\pageviews-20220416-170000',
);
final file = File(filePath);
logger.stderr('exporting pageviews...');
StreamSubscription? reader;
int lineNumer = 0;
reader = file.openRead().map(utf8.decode).transform(LineSplitter()).listen(
(line) {
final page = MostViewedPageDaily.fromLine(line);
db.collection('page_views').insert(page.toMap());
lineNumer ;
if (lineNumer % 1000 == 0) {
logger.stdout('inserting at line $lineNumer');
}
},
onDone: () {
logger.stdout('Reader read $lineNumer lines');
reader?.cancel();
exit(0);
},
one rror: (error, stackTrace) {
final message = '$lineNumer: $error\n\nStacktrace :$stackTrace';
logger.stdout(logger.ansi.error(message));
exit(1);
},
cancelOnError: true,
);
What can I do?
I downloaded the file from here
https://dumps.wikimedia.org/other/pageviews/2022/2022-04/pageviews-20220417-010000.gz
CodePudding user response:
You should use file.openRead().transform(utf8.decoder)
instead of file.openRead().map(utf8.decode)
. (Also note the argument difference: utf8.decoder
is a Utf8Decoder
object, and utf8.decode
is a method tear-off.)
The Stream.map
documentation specifically discusses this:
Unlike
transform
, this method does not treat the stream as chunks of a single value. Instead each event is converted independently of the previous and following events, which may not always be correct. For example, UTF-8 encoding, or decoding, will give wrong results if a surrogate pair, or a multibyte UTF-8 encoding, is split into separate events, and those events are attempted encoded or decoded independently.