I am trying to do the following in Python:
- Import iris data base
- create a new column with the name target_measure that will be the multiplication of the natural log of Sepal.Width times the squeared Petal.Width
- create a new variable called categorical_measure that will clasify the previous column into 3 labes like so: if target_measure<1.5 then it will be: "<1.5", target_measure>=1.5 and target_measure<3.5 then it will "1.5-3.5" any other will be "out of target"
- calculate the mean sepal and petal width grouping by species as well as the count of all labels in the column categorical_measure
- finally filter all rows with "out of target" count is equal or greater than 5
We can download/import the iris dataset here:
data=pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")
My R code goes as follows
library(tidyverse)
data=iris # R's built-in fun to import iris
#desired output
data %>% # this is known as a pipe in R and will exc the lines below feed from the data env object
group_by(Species) %>% #groups by species
mutate(target_measure=log(Sepal.Width)*(Petal.Width)^2)%>% #creates column target_measure
mutate(categorical_measure=case_when(target_measure<1.5~"<1.5", #creates column categorical_measure based on criteria
target_measure>=1.5 & target_measure<3.5~"1.5-3.5",
TRUE~"out of target")) %>%
summarise(mean_of_sepal=mean(Sepal.Width), #calculates mean of sepal.width of grouped data
mean_of_petal=mean(Petal.Width),
'No of 1.5'=sum(categorical_measure=="<1.5"), #calculates count label="<1.5" from column categorical_measure
'No of 1.5-3.5'=sum(categorical_measure=="1.5-3.5"),#calculates count label="1.5-3.5"
'No of out of target'=sum(categorical_measure=="out of target")) %>% #calculates count label="out of target"
filter(`No of out of target`>=5) # filters desired output
code without comments (for faster reading)
data %>%
group_by(Species) %>%
mutate(target_measure=log(Sepal.Width)*(Petal.Width)^2)%>%
mutate(categorical_measure=case_when(target_measure<1.5~"<1.5",
target_measure>=1.5 & target_measure<3.5~"1.5-3.5",
TRUE~"out of target")) %>%
summarise(mean_of_sepal=mean(Sepal.Width),
mean_of_petal=mean(Petal.Width),
'No of 1.5'=sum(categorical_measure=="<1.5"),
'No of 1.5-3.5'=sum(categorical_measure=="1.5-3.5"),
'No of out of target'=sum(categorical_measure=="out of target")) %>%
filter(`No of out of target`>=5)
My desired output is:
# A tibble: 1 x 6
Species mean_of_sepal mean_of_petal `No of 1.5` `No of 1.5-3.5` `No of out of target`
<fct> <dbl> <dbl> <int> <int> <int>
1 virginica 2.97 2.03 0 11 39
Is there a way to achive this level of simplicity in Python?
So far I have come across the pandas library and useful functions such as data.groupby(['species'])
but I alway find in each tutorial or YouTube video that each step is done separately or perhaps creating a function first and then using the .apply fun in Python but I am looking for a solution that will use pipes of some sort of structure alike.
CodePudding user response:
Maybe, using pyjanitor
is easier to you, but a solution based on pandas
is the following, which uses chaining (similar to tidyverse
piping):
df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")
(df.assign(target_measure=lambda x: np.log(x["sepal_width"]) * (x["petal_width"] ** 2))
.assign(categorical_measure=lambda x: x["target_measure"].map(lambda y: "<1.5" if y < 1.5 else "1.5-3.5" if y < 3.5 else "out of target"))
.groupby("species")
.agg({'sepal_width': [("mean_of_sepal", 'mean')], "petal_width": [('mean_of_petal', 'mean')],
'categorical_measure': [('No of 1.5', lambda x: sum(x == "<1.5")),
('No of 1.5-3.5', lambda x: sum(x == "1.5-3.5")),
('No of out of target', lambda x: sum(x == "out of target"))]}).droplevel(0, axis=1)
.loc[lambda x: x['No of out of target'] >= 5]
.reset_index())
Output:
species mean_of_sepal mean_of_petal No of 1.5 No of 1.5-3.5 \
0 virginica 2.974 2.026 0 11
No of out of target
0 39