Home > Enterprise >  Python Dataframe process two columns of lists and find minimum
Python Dataframe process two columns of lists and find minimum

Time:05-02

I have a data frame consisting of lists as elements. I want to subtract a value from each list and find the index of the minimum. I want to find the value corresponding to each list in another column.

My code:

df = pd.DataFrame({'A':[[1,2,3],[1,3,5,6]]})
df

      A              B
0   [1, 2, 3]       [10, 20, 30]
1   [1, 3, 5, 6]    [10, 30, 50, 60]  

# lets subtract 2 from A, find index of minimum in this result and find corresponding element in the B column
val = 2
df['A_new_min'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
df['B_new'] = df[['A_new_min','B']].apply(lambda x: x[1][x[0]],axis=1) 

Present solution: It produces a correct solution but I don't to want to store the A_new_min and it is unnecessary. I am looking if it is possible to get this result in one line of code?

df = 
    A               B                 A_new_min     B_new
0   [1, 2, 3]       [10, 20, 30]        1           20
1   [1, 3, 5, 6]    [10, 30, 50, 60]    0           10

Expected solution: How can I obtain the below solution directly without having to create an additional and unnecessary column A_new_min? In simple words, I would like to

df = 
    A               B                 B_new
0   [1, 2, 3]       [10, 20, 30]        20
1   [1, 3, 5, 6]    [10, 30, 50, 60]    10

CodePudding user response:

With apply:

df["B_new"] = df.apply(lambda row: row["B"][np.argmin(abs(np.array(row["A"])-val))], axis=1)

>>> df
              A                 B  B_new
0     [1, 2, 3]      [10, 20, 30]     20
1  [1, 3, 5, 6]  [10, 30, 50, 60]     10

CodePudding user response:

IMO, the most efficient is to use only a list comprehension.

B_new only:

df['B_new'] = [b[min(range(len(a)), key=lambda x: abs(a[x]-val))]
               for a,b in zip(df['A'], df['B'])]

output:

              A                 B  B_new
0     [1, 2, 3]      [10, 20, 30]     20
1  [1, 3, 5, 6]  [10, 30, 50, 60]     10

both columns:

df2 = pd.DataFrame([[(i:=min(range(len(a)), key=lambda x: abs(a[x]-val))), b[i]]
                    for a,b in zip(df['A'], df['B'])], columns=['A_new_min', 'B_new'])

df.join(df2)

output:

              A                 B  A_new_min  B_new
0     [1, 2, 3]      [10, 20, 30]          1     20
1  [1, 3, 5, 6]  [10, 30, 50, 60]          0     10
timings (on 200k rows)
# @mozway (option #1)
290 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @enke (list comprehension)
340 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @enke list comprehension   numpy
968 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @not_speshal
4.12 s ± 246 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

One option is to use a list comprehension:

df['new'] = [arr[i] for i, arr in zip(df['A'].map(np.array).sub(2).abs().map(np.argmin), df['B'])]

Another option is to never convert to numpy array at all and stick with lists:

df['new'] = [b[min(enumerate([abs(x-2) for x in a]), key=lambda x:x[1])[0]] for a,b in zip(df['A'], df['B'])]

Output:

              A                 B  new
0     [1, 2, 3]      [10, 20, 30]   20
1  [1, 3, 5, 6]  [10, 30, 50, 60]   10
  • Related