How can I prevent FormatException: Unfinished UTF-8 octet sequence-CodePudding

I have downloaded a Wikipedia dump and I am trying to read it line by line. But when doing the utf8-decode I get the following error

12633: FormatException: Unfinished UTF-8 octet sequence (at offset 65536)

Stacktrace :#0      _Utf8Decoder.convertSingle (dart:convert-patch/convert_patch.dart:1789:7)
#1      Utf8Decoder.convert (dart:convert/utf.dart:351:42)
#2      Utf8Codec.decode (dart:convert/utf.dart:63:20)
#3      _MapStream._handleData (dart:async/stream_pipe.dart:213:31)
#4      _ForwardingStreamSubscription._handleData (dart:async/stream_pipe.dart:153:13)
#5      _RootZone.runUnaryGuarded (dart:async/zone.dart:1618:10)
#6      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:341:11)
#7      _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271:7)
#8      _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:774:19)
#9      _StreamController._add (dart:async/stream_controller.dart:648:7)
#10     _StreamController.add (dart:async/stream_controller.dart:596:5)
#11     _FileStream._readBlock.<anonymous closure> (dart:io/file_impl.dart:98:19)
<asynchronous suspension>

That is this line

ar جزر_غالاباغوس 1 0

So I tried saving the file utf-8 encoded with this button

But that does not seem to work

This is my code

final filePath = p.join(
    Directory.current.path,
    'bin\\migrate_most_views\\data\\pageviews-20220416-170000',
  );
  final file = File(filePath);

  logger.stderr('exporting pageviews...');

  StreamSubscription? reader;
  int lineNumer = 0;
  reader = file.openRead().map(utf8.decode).transform(LineSplitter()).listen(
    (line) {
      final page = MostViewedPageDaily.fromLine(line);
      db.collection('page_views').insert(page.toMap());

      lineNumer  ;
      if (lineNumer % 1000 == 0) {
        logger.stdout('inserting at line $lineNumer');
      }
    },
    onDone: () {
      logger.stdout('Reader read $lineNumer lines');
      reader?.cancel();
      exit(0);
    },
    one rror: (error, stackTrace) {
      final message = '$lineNumer: $error\n\nStacktrace :$stackTrace';
      logger.stdout(logger.ansi.error(message));
      exit(1);
    },
    cancelOnError: true,
  );

What can I do?

I downloaded the file from here

https://dumps.wikimedia.org/other/pageviews/2022/2022-04/pageviews-20220417-010000.gz

CodePudding user response：

You should use file.openRead().transform(utf8.decoder) instead of file.openRead().map(utf8.decode). (Also note the argument difference: utf8.decoder is a Utf8Decoder object, and utf8.decode is a method tear-off.)

The Stream.map documentation specifically discusses this:

Unlike transform, this method does not treat the stream as chunks of a single value. Instead each event is converted independently of the previous and following events, which may not always be correct. For example, UTF-8 encoding, or decoding, will give wrong results if a surrogate pair, or a multibyte UTF-8 encoding, is split into separate events, and those events are attempted encoded or decoded independently.