Efficient way to iterate over a vast number of rows-CodePudding

Let's assume I have a .txt file, consisting of million of lines, separated by .:

dog breed poodle .
dog breed golden retriever .
cat breed some breed .
dog weight 10 .
cat weight 15 .
fish location sea .
cat location house .

From left to right, let's call the columns subject property value. My goal is to create for each unique property a new table, with 2 columns, namely subject and value. Given the example above, the output would be:

breed

dog poodle
dog golden retriever
cat some breed

weight

dog 15
cat 15

location

fish sea
cat house

I need some ideas on what data structure to choose to achieve these tables. For example, one approach could be a dictionary, where the each key represents a property, and for the table, consisting of subject and value, we could use pandas.

Since I don't think that pandas is a good choice for this task, I am also looking for an efficient library (if any?) to achieve this goal with proper performance

CodePudding user response：

The most straightforward approach is to just read the file and create a dictionary of lists of (subject, value) tuples with property as the key. If the formatting were a more standard csv format this would be easier, but it's not too hard to parse it with careful use of split and join:

>>> tables = {}
>>> with open("animals.txt") as f:
...     for line in f:
...         subject, prop, *v = line.strip().split(' ')
...         value = ' '.join(v[:-1])
...         tables.setdefault(prop, []).append((subject, value))
...
>>> tables
{'breed': [('dog', 'poodle'), ('dog', 'golden retriever'), ('cat', 'some breed')], 'weight': [('dog', '10'), ('cat', '15')], 'location': [('fish', 'sea'), ('cat', 'house')]}