Home > front end >  R Tidyverse Alternative Code in Python for data wrangling
R Tidyverse Alternative Code in Python for data wrangling

Time:12-12

I am trying to do the following in Python:

  1. Import iris data base
  2. create a new column with the name target_measure that will be the multiplication of the natural log of Sepal.Width times the squeared Petal.Width
  3. create a new variable called categorical_measure that will clasify the previous column into 3 labes like so: if target_measure<1.5 then it will be: "<1.5", target_measure>=1.5 and target_measure<3.5 then it will "1.5-3.5" any other will be "out of target"
  4. calculate the mean sepal and petal width grouping by species as well as the count of all labels in the column categorical_measure
  5. finally filter all rows with "out of target" count is equal or greater than 5

We can download/import the iris dataset here:

data=pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")

My R code goes as follows

library(tidyverse)


data=iris # R's built-in fun to import iris


#desired output

data %>% # this is known as a pipe in R and will exc the lines below feed from the data env object
  group_by(Species) %>%  #groups by species
  mutate(target_measure=log(Sepal.Width)*(Petal.Width)^2)%>% #creates column target_measure
  mutate(categorical_measure=case_when(target_measure<1.5~"<1.5",  #creates column categorical_measure based on criteria
                                       target_measure>=1.5 & target_measure<3.5~"1.5-3.5",
                                       TRUE~"out of target")) %>% 
  summarise(mean_of_sepal=mean(Sepal.Width), #calculates mean of sepal.width of grouped data
            mean_of_petal=mean(Petal.Width),
            'No of 1.5'=sum(categorical_measure=="<1.5"), #calculates count label="<1.5" from column categorical_measure
            'No of 1.5-3.5'=sum(categorical_measure=="1.5-3.5"),#calculates count label="1.5-3.5" 
            'No of out of target'=sum(categorical_measure=="out of target")) %>%  #calculates count label="out of target"
  filter(`No of out of target`>=5) # filters desired output

code without comments (for faster reading)

data %>% 
  group_by(Species) %>% 
  mutate(target_measure=log(Sepal.Width)*(Petal.Width)^2)%>% 
  mutate(categorical_measure=case_when(target_measure<1.5~"<1.5",
                                       target_measure>=1.5 & target_measure<3.5~"1.5-3.5",
                                       TRUE~"out of target")) %>% 
  summarise(mean_of_sepal=mean(Sepal.Width),
            mean_of_petal=mean(Petal.Width),
            'No of 1.5'=sum(categorical_measure=="<1.5"),
            'No of 1.5-3.5'=sum(categorical_measure=="1.5-3.5"),
            'No of out of target'=sum(categorical_measure=="out of target")) %>% 
  filter(`No of out of target`>=5)

My desired output is:

# A tibble: 1 x 6
  Species   mean_of_sepal mean_of_petal `No of 1.5` `No of 1.5-3.5` `No of out of target`
  <fct>             <dbl>         <dbl>       <int>           <int>                 <int>
1 virginica          2.97          2.03           0              11                    39

Is there a way to achive this level of simplicity in Python?

So far I have come across the pandas library and useful functions such as data.groupby(['species']) but I alway find in each tutorial or YouTube video that each step is done separately or perhaps creating a function first and then using the .apply fun in Python but I am looking for a solution that will use pipes of some sort of structure alike.

CodePudding user response:

Maybe, using pyjanitor is easier to you, but a solution based on pandas is the following, which uses chaining (similar to tidyverse piping):

df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")

(df.assign(target_measure=lambda x: np.log(x["sepal_width"]) * (x["petal_width"] ** 2))
 .assign(categorical_measure=lambda x: x["target_measure"].map(lambda y: "<1.5" if y < 1.5 else "1.5-3.5" if y < 3.5 else "out of target"))
 .groupby("species")
 .agg({'sepal_width': [("mean_of_sepal", 'mean')], "petal_width": [('mean_of_petal', 'mean')],
       'categorical_measure': [('No of 1.5', lambda x: sum(x == "<1.5")),
                               ('No of 1.5-3.5', lambda x: sum(x == "1.5-3.5")),
                               ('No of out of target', lambda x: sum(x == "out of target"))]}).droplevel(0, axis=1)
 .loc[lambda x: x['No of out of target'] >= 5]
 .reset_index())

Output:

     species  mean_of_sepal  mean_of_petal  No of 1.5  No of 1.5-3.5  \
0  virginica          2.974          2.026          0             11   

   No of out of target  
0                   39  
  • Related