How to identify array records that has value in dataframe-CodePudding

my dataframe df looks like this

       RepID   Arr
0      3       [E83, J99]
1      5       [P81, G39, E34]
2      6       [R19]
...
...
1857   8885    [G39, F14, T22]

I want to find all records that have the value "G39" in the array

       RepID   Arr
1      5       [P81, G39, E34]
1857   8885    [G39, F14, T22]

I tried this but it did not work

np.where(df["Arr"]=="G39")

I get empty

(array([], dtype=int64),)

How to get that list?

CodePudding user response：

You can use list comprehension to iterate through array column and filter and collect dataframe records based on condition:

import pandas as pd

data = [
  (3, ["E83", "J99"]),
  (5, ["P81", "G39", "E34"]),
  (6, ["R19"]),
  (8885, ["G39", "F14", "T22"]),
]

df = pd.DataFrame(data=data, columns=["RepID", "Arr"])

df[[("G39" in typ_arr) for typ_arr in df["Arr"]]]
>>     RepID               Arr
>> 1       5   [P81, G39, E34]
>> 3    8885   [G39, F14, T22]

CodePudding user response：

The code below creates both, a dictionary and a pandas data frame with the requested records:

import pandas as pd
from pprint import pprint
df_print="""\
       RepID   Arr
0      3       [E83, J99]
1      5       [P81, G39, E34]
2      6       [R19]
3   8885       [G39, F14, T22]"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'\s\s\s*', index_col = 0)

dct_rows = {}
for index, row in df.iterrows():
   if "G39" in row["Arr"]:
       dct_rows[index] = { 'RepID' : row['RepID'], "Arr" : row["Arr"] }
print("dct_rows = \\\n", end='')
pprint(dct_rows)
print("---")
df_G39 = pd.DataFrame.from_dict(dct_rows, orient='index')
print("df_G39_print = \"\"\"\\")
print(df_G39, end='"""\n')

printing:

dct_rows = \
{1: {'Arr': '[P81, G39, E34]', 'RepID': 5},
 3: {'Arr': '[G39, F14, T22]', 'RepID': 8885}}
---
df_G39_print = """\
   RepID              Arr
1      5  [P81, G39, E34]
3   8885  [G39, F14, T22]"""

Notice that the code reads the data from the printed version of a pandas DataFrame and prints ready to use Python code assigning values to Python variables for further use.

The code here is not using the pandas way of doing things. If you prefer the pandas way use the much better pandas suitable and elegant code provided by Azhar Khan in the other answer. The short code of the pandas way comes at the cost of learning to understand how it works. For these ones who want to understand it I will provide below some explanations:

data = [
  (3, ["E83", "J99"]),
  (5, ["P81", "G39", "E34"]),
  (6, ["R19"]),
  (8885, ["G39", "F14", "T22"]),
]
df = pd.DataFrame(data=data, columns=["RepID", "Arr"])

The code above requires total manual reformatting of the provided printed output of the dataframe in order to get these data into the df variable, what means a bit of work with the keyboard not limited to copy/paste.

Now let's check out the genius pandas way one-liner:

dfG39 = df[ [ ("G39" in row) for row in df["Arr"] ] ]

which is a bit hard to understand for someone not familiar with pandas or numpy. How can it work without throwing an Error using a list comprehension as index to a table?

It works because numpy arrays and pandas dataframes accept indices which are not accepted by a Python list or a dictionary and can be next to integer values and slices also for example a list of True/False values used to filter out entries from an iterable called dataframe in pandas and array in numpy.

Printing single steps of executing the one-liner explains the short code:

print(df["Arr"])
print([row for row in df["Arr"]])
print([ ("G39" in row) for row in df["Arr"] ])
print( df[ [False, True, False, True] ] )
# gives: 
printed_output = """\
Name: Arr, dtype: object
[['E83', 'J99'], ['P81', 'G39', 'E34'], ['R19'], ['G39', 'F14', 'T22']]
[False, True, False, True]
   RepID              Arr
1      5  [P81, G39, E34]
3   8885  [G39, F14, T22]"""

As demonstrated with the above code indexing a table in pandas can be done with a list of True/False values and be used to filter out records at indices where the indices list is set to True. For a Python user without numpy or pandas experience a big surprise and challenge for understanding.