Identifying partial character encoding/compression in text content-CodePudding

I have a CSV (extracted from BZ2) where only some values are encoded:

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

The |, 0 and 1 characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.

I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2 module, or with Pandas and read_csv:

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()

import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

How can I identify which type of encoding type to decode with?

Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main

Source file: test-balanced.csv.bz2

First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh

I asked the original authors of the CSV/dataset but they didn't respond which is understandable.

CodePudding user response：

From readme.txt:

File Guide:

raw/key.csv: column key for raw/sarc.csv

raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json

*/comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format

/.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that sequence, and sarcastic/non-sarcastic labels for those responses. The format is
post_id comment_id … comment_id|response_id … response_id|label … label
where *_id is a key to */comments.json and label 1 indicates the respective response_id maps to a sarcastic response.
Thus each row has three entries (comment chain, responses, labels) delimited by '|', and each of these entries has elements delimited by spaces.
The first entry always contains a post_id and 0 or more comment_ids. The second and third entries have the same number of elements, with the first response_id corresponding to the first label and so on.

Converting above to a Python code snippet:

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

Note that files were (manually) downloaded from the pol directory for their acceptable size (pol: contains subset of main dataset corresponding to comments in /r/politics).

Result: D:\bat\SO\71596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}