Home > other >  how to loop through and match two date columns and extract the associated codes to a new list
how to loop through and match two date columns and extract the associated codes to a new list

Time:11-12

I have a pandas df with a code col and a date col (type: object) with 289k entries. each date have multiple codes, thus fx 10 rows with the same date and different codes in the next col and then 20 rows with a new date with new codes etc.. I also have an ndarray containing dates (type: str) with 103 entries. I want to match all dates from my ndarray with the df and every time a match is found I want to extract all associated "codes" from that specific date to a new list. I have tried many different things but without much success.

filtered_codes = []
for j in raw_data.Dates:
    for q in reb_dates:
        if j == q:
            filtered_codes.append(raw_data.codes)

filtered_codes = []
 for j in reb_dates:
     for q in raw_data.Dates:
         if raw_data['Dates'][q] == reb_dates[j]:
             filtered_codes.append(raw_data.codes[q])

Running the first one gives me a list of series. All lists are identical and the series contains as many entries as my raw_data.

If I run the second example I get this error:

Traceback (most recent call last):
  File "C:\Users\xxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index_class_helper.pxi", line 105, in pandas._libs.index.Int64Engine._check_type
  File "pandas\_libs\index_class_helper.pxi", line 105, in pandas._libs.index.Int64Engine._check_type
KeyError: '2013-02-20'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "C:\Users\xxx.py", line 73, in <module>
    if raw_data['Dates'][q] == reb_dates[j]:
  File "C:\Users\xxx\venv\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "C:\Usersxxx\venv\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "C:\Users\xxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: '2013-02-20'

Process finished with exit code 1

Is it the loop causing the problem or the data types? I'm not that experienced in python. Any help is appreiciated

CodePudding user response:

The rule is to avoid as much as possible any Python level loop on a dataframe.

But let us first look at your current codes:

filtered_codes = []
for j in raw_data.Dates:
    for q in reb_dates:
        if j == q:
            filtered_codes.append(raw_data.codes) # Oops !!

You append to the list the full column raw_data.codes and do it for every time a date from reb_dates is present in the Dates column. Not what you want...

filtered_codes = []
 for j in reb_dates:
     for q in raw_data.Dates:
         if raw_data['Dates'][q] == reb_dates[j]:         # Oops
             filtered_codes.append(raw_data.codes[q])     # Oops again...

The first Oops line should be if q == j: because q and j are the actual date values not their indexes in their containers. And as q is a string value representing a date and not an index, raw_data.codes[q] on next line is also an error.

You could easily build a Series where the dates are the index (with a str type so an object dtype) and the values are the set of codes for that date:

codes_per_date = raw_data[raw_data['Dates'].isin(reb_dates)].groupby(
    'Dates')['code'].agg(set)
  • Related