Contain in a list the values that appear in an array in a given percentage-CodePudding

I have an array called "data" which contains the following information.

[['amazon',
  'phone',
  'serious',
  'mind',
  'blown',
  'serious',
  'enjoy',
  'use',
  'applic',
  'full',
  'blown',
  'websit',
  'allow',
  'quick',
  'track',
  'packag',
  'descript',
  'say'],
 ['would',
  'say',
  'app',
  'real',
  'thing',
  'show',
  'ghost',
  'said',
  'quot',
  'orang',
  'quot',
  'ware',
  'orang',
  'cloth',
  'app',
  'adiquit',
  'would',
  'recsmend',
  'want',
  'talk',
  'ghost'],
 ['love',
  'play',
  'backgammonthi',
  'game',
  'offer',
  'varieti',
  'difficulti',
  'make',
  'perfect',
  'beginn',
  'season',
  'player'],

The case is that I would like to save in a list, the values that appear at least 1% in this array.

The closest approximation I have found is the following but it does not return what I need. Any ideas?

import numpy_indexed as npi

idx = [np.ones(len(a))*i for i, a in enumerate(tokens_list_train)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(tokens_list_train))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)`

CodePudding user response：

let's see, we can remove the nesting using itertool.chain.from_iterable, but we also need the total length, which we can compute by making another generator to avoid looping twice, and we need to count the repetitions, which is done by a counter.

from collections import Counter
from itertools import chain

total_length = 0
def sum_sublist_length(some_list):  # to sum the lengths of the sub-lists
    global total_length
    for value in some_list:
        total_length  = len(value)
        yield value
        
counts = Counter(chain.from_iterable(sum_sublist_length(my_list)))
items = [item for item in counts if counts[item]/total_length >= 0.01]
print(items)

['amazon', 'phone', 'serious', 'mind', 'blown', 'enjoy', 'use', 'applic', 'full', 'websit', 'allow', 'quick', 'track', 'packag', 'descript', 'say', 'would', 'app', 'real', 'thing', 'show', 'ghost', 'said', 'quot', 'orang', 'ware', 'cloth', 'adiquit', 'recsmend', 'want', 'talk', 'love', 'play', 'backgammonthi', 'game', 'offer', 'varieti', 'difficulti', 'make', 'perfect', 'beginn', 'season', 'player']

CodePudding user response：

Here's another way to generate a list of elements that appear 1% or more of the time, using pandas.DataFrame:


import numpy as np
import pandas as pd


# == Define `flatten` function to combine objects with multi-level nesting =======
def flatten(iterable, base_type=None, levels=None):
    """Flatten an iterable with multiple levels of nesting.

        >>> iterable = [(1, 2), ([3, 4], [[5], [6]])]
        >>> list(flatten(iterable))
        [1, 2, 3, 4, 5, 6]

    Binary and text strings are not considered iterable and
    will not be collapsed.

    To avoid collapsing other types, specify *base_type*:

        >>> iterable = ['ab', ('cd', 'ef'), ['gh', 'ij']]
        >>> list(flatten(iterable, base_type=tuple))
        ['ab', ('cd', 'ef'), 'gh', 'ij']

    Specify *levels* to stop flattening after a certain level:

    >>> iterable = [('a', ['b']), ('c', ['d'])]
    >>> list(flatten(iterable))  # Fully flattened
    ['a', 'b', 'c', 'd']
    >>> list(flatten(iterable, levels=1))  # Only one level flattened
    ['a', ['b'], 'c', ['d']]

    """
    def walk(node, level):
        if (
            ((levels is not None) and (level > levels))
            or isinstance(node, (str, bytes))
            or ((base_type is not None) and isinstance(node, base_type))
        ):
            yield node
            return
        try:
            tree = iter(node)
        except TypeError:
            yield node
            return
        else:
            for child in tree:
                yield from walk(child, level   1)
    yield from walk(iterable, 0)


# == Problem Solution ==========================================================
# 1. Flatten the array into a single level list of elements, then convert it
#    to a `pandas.Series`.
series_array = pd.Series(list(flatten(array)))

# 2. Get the total number of elements in flattened list
element_count = len(series_array)

# 3. Use method `pandas.Series.value_counts() to count the number of times each
#    elements appears, then divide each element count by the
#    total number of elements in flattened list (`element_count`)
elements = (
    (series_array.value_counts()/element_count)
    # 4. Use `pandas.Series.loc` to select only values that appear more than
    #    1% of the time.
    # .loc[lambda xdf: xdf['rate_count'] >= 0.01, :]
    .loc[lambda value: value >= 0.01]
    # 5. Select the elements, and convert results to a list
    .index.to_list()
)
print(elements)
['would', 'serious', 'blown', 'quot', 'orang', 'app', 'ghost', 'say', 'use', 'adiquit', 'enjoy', 'said', 'cloth', 'thing', 'applic', 'talk', 'player', 'track', 'recsmend', 'beginn', 'packag', 'allow', 'perfect', 'want', 'real', 'love', 'full', 'show', 'play', 'make', 'backgammonthi', 'mind', 'amazon', 'game', 'difficulti', 'offer', 'descript', 'websit', 'quick', 'season', 'phone', 'variety', 'ware']