Home > OS >  Is there a package in Julia similar to dplyr?
Is there a package in Julia similar to dplyr?

Time:10-13

Suppose I have the following dataframe:

using DataFrames
a = DataFrame(A = randn(1000), B = randn(1000), C = randn(1000));
N = 1000;

Like I want to divide every column by N (if it's numeric), so in R I would do the following (using dplyr):

a <- a %>% mutate_if(is.numeric, function(x) x/N)

Is there something like this in Julia?

(I am trying to avoid for loops, and to do the operation column by column)

CodePudding user response:

DataFrames documentation has a Comparison with dplyr section. You can see that mutates in dplyr correspond to transforms in DataFrames.jl. transform also allows many ways to select the columns to operate on, which can be used for the mutate_if functionality.

julia> df = DataFrame(x = [10, 15, 20, 25], y = [12.5, 20, 101, 102], colors = [:red, :blue, :green, :cyan])

4×3 DataFrame
 Row │ x      y        colors 
     │ Int64  Float64  Symbol 
─────┼────────────────────────
   1 │    10     12.5  red
   2 │    15     20.0  blue
   3 │    20    101.0  green
   4 │    25    102.0  cyan

julia> transform(df, Cols(in(names(df, Number))) => ByRow((c...) -> c ./ 5) => identity)
4×3 DataFrame
 Row │ x        y        colors 
     │ Float64  Float64  Symbol 
─────┼──────────────────────────
   1 │     2.0      2.5  red
   2 │     3.0      4.0  blue
   3 │     4.0     20.2  green
   4 │     5.0     20.4  cyan

Cols(in(names(df, Number))) is the column selector here. names(df, Number) returns the columns whose element type is a subtype of Number. Cols selects those columns which are in this set of columns, as the ones to which the transform should be applied.
ByRow((c...) -> c ./ 5) takes values from those columns in each row, and divides them by 5.
identity just tells transform not to change the column names in the result.

transform above returns the result dataframe, without changing df. You can use
transform!(df, Cols(in(names(df, Number))) => ByRow((c...) -> c ./ 5) => identity)
(note the ! after transform) to do this operation in-place and update df directly instead.

CodePudding user response:

tranform is very powerful and it may feel more natural if you come from a dplyr background, but I fill in this case using a simple loop over columns and broadcasting over items more natural.

Don't be afraid of loops in Julia: they are (generally speaking) as fast as vectorised code and can be written very concisely using array comprehension:

julia> df = DataFrame(x = [10, missing, 20, 25], y = [12.5, 20, 101, 102], colors = [:red, :blue, missing, :cyan])
4×3 DataFrame
 Row │ x        y        colors  
     │ Int64?   Float64  Symbol? 
─────┼───────────────────────────
   1 │      10     12.5  red
   2 │ missing     20.0  blue
   3 │      20    101.0  missing 
   4 │      25    102.0  cyan

julia> [c .= c ./ 5 for c in eachcol(df) if nonmissingtype(eltype(c)) <: Number];

julia> df
4×3 DataFrame
 Row │ x        y        colors  
     │ Int64?   Float64  Symbol? 
─────┼───────────────────────────
   1 │       2      2.5  red
   2 │ missing      4.0  blue
   3 │       4     20.2  missing 
   4 │       5     20.4  cyan

In the above example, the dot is to indicate broadcasting (each element of the column c is the result of the division of the old value by the scalar 5) and I have used nonmissingtype to account for the case where you may have missing data.

  • Related