how to loop through and match two date columns and extract the associated codes to a new list-CodePudding

I have a pandas df with a code col and a date col (type: object) with 289k entries. each date have multiple codes, thus fx 10 rows with the same date and different codes in the next col and then 20 rows with a new date with new codes etc.. I also have an ndarray containing dates (type: str) with 103 entries. I want to match all dates from my ndarray with the df and every time a match is found I want to extract all associated "codes" from that specific date to a new list. I have tried many different things but without much success.

filtered_codes = []
for j in raw_data.Dates:
    for q in reb_dates:
        if j == q:
            filtered_codes.append(raw_data.codes)

filtered_codes = []
 for j in reb_dates:
     for q in raw_data.Dates:
         if raw_data['Dates'][q] == reb_dates[j]:
             filtered_codes.append(raw_data.codes[q])

Running the first one gives me a list of series. All lists are identical and the series contains as many entries as my raw_data.

If I run the second example I get this error:

Traceback (most recent call last):
  File "C:\Users\xxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index_class_helper.pxi", line 105, in pandas._libs.index.Int64Engine._check_type
  File "pandas\_libs\index_class_helper.pxi", line 105, in pandas._libs.index.Int64Engine._check_type
KeyError: '2013-02-20'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "C:\Users\xxx.py", line 73, in <module>
    if raw_data['Dates'][q] == reb_dates[j]:
  File "C:\Users\xxx\venv\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "C:\Usersxxx\venv\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "C:\Users\xxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: '2013-02-20'

Process finished with exit code 1

Is it the loop causing the problem or the data types? I'm not that experienced in python. Any help is appreiciated

CodePudding user response：

The rule is to avoid as much as possible any Python level loop on a dataframe.

But let us first look at your current codes:

filtered_codes = []
for j in raw_data.Dates:
    for q in reb_dates:
        if j == q:
            filtered_codes.append(raw_data.codes) # Oops !!

You append to the list the full column raw_data.codes and do it for every time a date from reb_dates is present in the Dates column. Not what you want...

filtered_codes = []
 for j in reb_dates:
     for q in raw_data.Dates:
         if raw_data['Dates'][q] == reb_dates[j]:         # Oops
             filtered_codes.append(raw_data.codes[q])     # Oops again...

The first Oops line should be if q == j: because q and j are the actual date values not their indexes in their containers. And as q is a string value representing a date and not an index, raw_data.codes[q] on next line is also an error.

You could easily build a Series where the dates are the index (with a str type so an object dtype) and the values are the set of codes for that date:

codes_per_date = raw_data[raw_data['Dates'].isin(reb_dates)].groupby(
    'Dates')['code'].agg(set)