I have a CSV (extracted from BZ2) where only some values are encoded:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
The |
, 0
and 1
characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.
I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2
module, or with Pandas and read_csv
:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
How can I identify which type of encoding type to decode with?
Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main
Source file: test-balanced.csv.bz2
First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh
I asked the original authors of the CSV/dataset but they didn't respond which is understandable.
CodePudding user response:
From readme.txt:
File Guide:
- raw/key.csv: column key for raw/sarc.csv
- raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
- */comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
- /.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that sequence, and sarcastic/non-sarcastic labels for those responses. The format is
post_id comment_id … comment_id|response_id … response_id|label … label
where*_id
is a key to */comments.json andlabel
1 indicates the respectiveresponse_id
maps to a sarcastic response.
Thus each row has three entries (comment chain, responses, labels) delimited by '|', and each of these entries has elements delimited by spaces.
The first entry always contains apost_id
and 0 or morecomment_ids
. The second and third entries have the same number of elements, with the firstresponse_id
corresponding to the first label and so on.
Converting above to a Python code snippet:
import pandas as pd
import json
from pprint import pprint
file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
sep='|',
names=['posts','responses','labels'],
encoding='utf-8')
file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
data_json = json.load(f)
print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} post_id: {post_id}')
pprint(data_json[post_id])
for response_id in data_csv['responses'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} response_id: {response_id}')
pprint(data_json[response_id])
Note that files were (manually) downloaded from the pol
directory for their acceptable size (pol
: contains subset of main dataset corresponding to comments in /r/politics).
Result: D:\bat\SO\71596864.py
First csv line decoded:
post_id: hqa1x
{'author': 'joshlamb619',
'created_utc': 1307053256,
'date': '2011-06',
'downs': 359,
'score': 274,
'subreddit': 'politics',
'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
'candidates during recall elections.',
'ups': 633}
response_id: c1xiujs
{'author': 'Artisane',
'created_utc': 1307077221,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "And we're upset since the Democrats would *never* try something as "
'sneaky as this, right?',
'ups': -2}
response_id: c1xj4e2
{'author': 'stellarfury',
'created_utc': 1307080843,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
"Picture this we were makin' up candidates Being huge election whores",
'ups': -2}