how could I create a mean value column with the multiple values column-CodePudding

I've created a new column with str.findall() method because the original value is a sentence, like "Preis ab 23,48 EUR pro Person/Tag" so I extracted the number ['23,48'] as a new value, but some sentences contain two prices, Like "from €83 to €120", so I will get the new value ['83' , '120'], now I want to replace the row with two values with the average, but I can't find a solution plan

The extract method is as follows:

df['pricerange_All'] = df.pricerange.str.findall(r'([0-9,.] (?:\.[0-9,.] )?)')

and the dataframe which looks like this:

       pricerange_All 
0         ['16.51']
1         ['17,61']
2         ['16,96']
3         ['13,70']
4         ['17,85']
.
100       ['690.94', '690.76']

as you can see some of the values have two prices

I've try the solution :

df2= pd.DataFrame([pd.Series(x) for x in df.pricerange_All])
df2.columns = ['price_{}'.format(x 1) for x in df2.columns]

but it didn't work

How to separate the column or create a new column that save the mean value of each row? like:

         mean_price    
0         16.51 
1         17,61 
2         16,96   
3         13,70   
4         17,85  
.           .
100       690.85

I've racked my brains and really need to solve this as soon as possible. Thanks.

Really appreciate for the reply.

CodePudding user response：

If you do in fact want every comma and decimal to remain the same, then the final column must be of a string type because you can't have floats and strings together.

You could use pd.apply and convert the elements of any lists longer than 1 to floats, take the mean, and convert the result back to a string type. For example:

df['pricerange_All_new'] = df['pricerange_All'].apply(lambda row: row[0] if len(row) == 1 else str(np.mean([float(p.replace(',','.')) for p in row])))

Result:

df
     pricerange_All pricerange_All_new
0           [16.51]              16.51
1           [17,61]              17,61
2           [16,96]              16,96
3           [13,70]              13,70
4           [17,85]              17,85
5  [690.94, 690.76]             690.85

CodePudding user response：

Assuming you have a dataframe (df) that looks like this:

df = pd.DataFrame({ 'pricerange_All': [['16.51'], ['17,61'], ['16,96'], ['13,70'], ['17,85'], ['690.94', '690.76']]})

print(df)

----------------------------

     pricerange_All
0           [16.51]
1           [17,61]
2           [16,96]
3           [13,70]
4           [17,85]
5  [690.94, 690.76]

You could use a lambda function to apply the mean function from the statistics package to each:

Code:

from statistics import mean 

df.pricerange_All = df.pricerange_All.apply(lambda x: mean(list(map(float, [item.replace(",", ".") for item in x]))))
print(df)

Output:

   pricerange_All
0           16.51
1           17.61
2           16.96
3           13.70
4           17.85
5          690.85

Alternative (keep commas):

If you want to keep the commas, you could something like this instead:

df.pricerange_All = df.pricerange_All.apply(lambda x: mean(list(map(float, [item.replace(",", ".") for item in x]))) if len(x) > 1 else x[0])
print(df)

Output:

  pricerange_All
0          16.51
1          17,61
2          16,96
3          13,70
4          17,85
5         690.85