Let's assume I have a .txt file, consisting of million of lines, separated by .
:
dog breed poodle .
dog breed golden retriever .
cat breed some breed .
dog weight 10 .
cat weight 15 .
fish location sea .
cat location house .
From left to right, let's call the columns subject property value
. My goal is to create for each unique property a new table, with 2 columns, namely subject
and value
. Given the example above, the output would be:
breed
dog poodle
dog golden retriever
cat some breed
weight
dog 15
cat 15
location
fish sea
cat house
I need some ideas on what data structure to choose to achieve these tables. For example,
one approach could be a dictionary, where the each key represents a property
, and for the table, consisting of subject
and value
, we could use pandas.
Since I don't think that pandas is a good choice for this task, I am also looking for an efficient library (if any?) to achieve this goal with proper performance
CodePudding user response:
The most straightforward approach is to just read the file and create a dictionary of lists of (subject, value)
tuples with property
as the key. If the formatting were a more standard csv format this would be easier, but it's not too hard to parse it with careful use of split
and join
:
>>> tables = {}
>>> with open("animals.txt") as f:
... for line in f:
... subject, prop, *v = line.strip().split(' ')
... value = ' '.join(v[:-1])
... tables.setdefault(prop, []).append((subject, value))
...
>>> tables
{'breed': [('dog', 'poodle'), ('dog', 'golden retriever'), ('cat', 'some breed')], 'weight': [('dog', '10'), ('cat', '15')], 'location': [('fish', 'sea'), ('cat', 'house')]}