I have a pandas df with a code col and a date col (type: object) with 289k entries. each date have multiple codes, thus fx 10 rows with the same date and different codes in the next col and then 20 rows with a new date with new codes etc.. I also have an ndarray containing dates (type: str) with 103 entries. I want to match all dates from my ndarray with the df and every time a match is found I want to extract all associated "codes" from that specific date to a new list. I have tried many different things but without much success.
filtered_codes = []
for j in raw_data.Dates:
for q in reb_dates:
if j == q:
filtered_codes.append(raw_data.codes)
filtered_codes = []
for j in reb_dates:
for q in raw_data.Dates:
if raw_data['Dates'][q] == reb_dates[j]:
filtered_codes.append(raw_data.codes[q])
Running the first one gives me a list of series. All lists are identical and the series contains as many entries as my raw_data.
If I run the second example I get this error:
Traceback (most recent call last):
File "C:\Users\xxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index_class_helper.pxi", line 105, in pandas._libs.index.Int64Engine._check_type
File "pandas\_libs\index_class_helper.pxi", line 105, in pandas._libs.index.Int64Engine._check_type
KeyError: '2013-02-20'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "C:\Users\xxx.py", line 73, in <module>
if raw_data['Dates'][q] == reb_dates[j]:
File "C:\Users\xxx\venv\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
return self._get_value(key)
File "C:\Usersxxx\venv\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\xxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: '2013-02-20'
Process finished with exit code 1
Is it the loop causing the problem or the data types? I'm not that experienced in python. Any help is appreiciated
CodePudding user response:
The rule is to avoid as much as possible any Python level loop on a dataframe.
But let us first look at your current codes:
filtered_codes = []
for j in raw_data.Dates:
for q in reb_dates:
if j == q:
filtered_codes.append(raw_data.codes) # Oops !!
You append to the list the full column raw_data.codes
and do it for every time a date from reb_dates
is present in the Dates column. Not what you want...
filtered_codes = []
for j in reb_dates:
for q in raw_data.Dates:
if raw_data['Dates'][q] == reb_dates[j]: # Oops
filtered_codes.append(raw_data.codes[q]) # Oops again...
The first Oops line should be if q == j:
because q
and j
are the actual date values not their indexes in their containers. And as q
is a string value representing a date and not an index, raw_data.codes[q]
on next line is also an error.
You could easily build a Series where the dates are the index (with a str type so an object dtype) and the values are the set of codes for that date:
codes_per_date = raw_data[raw_data['Dates'].isin(reb_dates)].groupby(
'Dates')['code'].agg(set)