I have an array called "data" which contains the following information.
[['amazon',
'phone',
'serious',
'mind',
'blown',
'serious',
'enjoy',
'use',
'applic',
'full',
'blown',
'websit',
'allow',
'quick',
'track',
'packag',
'descript',
'say'],
['would',
'say',
'app',
'real',
'thing',
'show',
'ghost',
'said',
'quot',
'orang',
'quot',
'ware',
'orang',
'cloth',
'app',
'adiquit',
'would',
'recsmend',
'want',
'talk',
'ghost'],
['love',
'play',
'backgammonthi',
'game',
'offer',
'varieti',
'difficulti',
'make',
'perfect',
'beginn',
'season',
'player'],
The case is that I would like to save in a list, the values that appear at least 1% in this array.
The closest approximation I have found is the following but it does not return what I need. Any ideas?
import numpy_indexed as npi
idx = [np.ones(len(a))*i for i, a in enumerate(tokens_list_train)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(tokens_list_train))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)`
CodePudding user response:
let's see, we can remove the nesting using itertool.chain.from_iterable
, but we also need the total length, which we can compute by making another generator to avoid looping twice, and we need to count the repetitions, which is done by a counter.
from collections import Counter
from itertools import chain
total_length = 0
def sum_sublist_length(some_list): # to sum the lengths of the sub-lists
global total_length
for value in some_list:
total_length = len(value)
yield value
counts = Counter(chain.from_iterable(sum_sublist_length(my_list)))
items = [item for item in counts if counts[item]/total_length >= 0.01]
print(items)
['amazon', 'phone', 'serious', 'mind', 'blown', 'enjoy', 'use', 'applic', 'full', 'websit', 'allow', 'quick', 'track', 'packag', 'descript', 'say', 'would', 'app', 'real', 'thing', 'show', 'ghost', 'said', 'quot', 'orang', 'ware', 'cloth', 'adiquit', 'recsmend', 'want', 'talk', 'love', 'play', 'backgammonthi', 'game', 'offer', 'varieti', 'difficulti', 'make', 'perfect', 'beginn', 'season', 'player']
CodePudding user response:
Here's another way to generate a list of elements that appear 1% or more of the time, using pandas.DataFrame
:
import numpy as np
import pandas as pd
# == Define `flatten` function to combine objects with multi-level nesting =======
def flatten(iterable, base_type=None, levels=None):
"""Flatten an iterable with multiple levels of nesting.
>>> iterable = [(1, 2), ([3, 4], [[5], [6]])]
>>> list(flatten(iterable))
[1, 2, 3, 4, 5, 6]
Binary and text strings are not considered iterable and
will not be collapsed.
To avoid collapsing other types, specify *base_type*:
>>> iterable = ['ab', ('cd', 'ef'), ['gh', 'ij']]
>>> list(flatten(iterable, base_type=tuple))
['ab', ('cd', 'ef'), 'gh', 'ij']
Specify *levels* to stop flattening after a certain level:
>>> iterable = [('a', ['b']), ('c', ['d'])]
>>> list(flatten(iterable)) # Fully flattened
['a', 'b', 'c', 'd']
>>> list(flatten(iterable, levels=1)) # Only one level flattened
['a', ['b'], 'c', ['d']]
"""
def walk(node, level):
if (
((levels is not None) and (level > levels))
or isinstance(node, (str, bytes))
or ((base_type is not None) and isinstance(node, base_type))
):
yield node
return
try:
tree = iter(node)
except TypeError:
yield node
return
else:
for child in tree:
yield from walk(child, level 1)
yield from walk(iterable, 0)
# == Problem Solution ==========================================================
# 1. Flatten the array into a single level list of elements, then convert it
# to a `pandas.Series`.
series_array = pd.Series(list(flatten(array)))
# 2. Get the total number of elements in flattened list
element_count = len(series_array)
# 3. Use method `pandas.Series.value_counts() to count the number of times each
# elements appears, then divide each element count by the
# total number of elements in flattened list (`element_count`)
elements = (
(series_array.value_counts()/element_count)
# 4. Use `pandas.Series.loc` to select only values that appear more than
# 1% of the time.
# .loc[lambda xdf: xdf['rate_count'] >= 0.01, :]
.loc[lambda value: value >= 0.01]
# 5. Select the elements, and convert results to a list
.index.to_list()
)
print(elements)
['would', 'serious', 'blown', 'quot', 'orang', 'app', 'ghost', 'say', 'use', 'adiquit', 'enjoy', 'said', 'cloth', 'thing', 'applic', 'talk', 'player', 'track', 'recsmend', 'beginn', 'packag', 'allow', 'perfect', 'want', 'real', 'love', 'full', 'show', 'play', 'make', 'backgammonthi', 'mind', 'amazon', 'game', 'difficulti', 'offer', 'descript', 'websit', 'quick', 'season', 'phone', 'variety', 'ware']