Suppose I have the following dataframe:
using DataFrames
a = DataFrame(A = randn(1000), B = randn(1000), C = randn(1000));
N = 1000;
Like I want to divide every column by N (if it's numeric), so in R I would do the following (using dplyr):
a <- a %>% mutate_if(is.numeric, function(x) x/N)
Is there something like this in Julia?
(I am trying to avoid for loops, and to do the operation column by column)
CodePudding user response:
DataFrames documentation has a Comparison with dplyr section. You can see that mutate
s in dplyr correspond to transform
s in DataFrames.jl. transform
also allows many ways to select the columns to operate on, which can be used for the mutate_if
functionality.
julia> df = DataFrame(x = [10, 15, 20, 25], y = [12.5, 20, 101, 102], colors = [:red, :blue, :green, :cyan])
4×3 DataFrame
Row │ x y colors
│ Int64 Float64 Symbol
─────┼────────────────────────
1 │ 10 12.5 red
2 │ 15 20.0 blue
3 │ 20 101.0 green
4 │ 25 102.0 cyan
julia> transform(df, Cols(in(names(df, Number))) => ByRow((c...) -> c ./ 5) => identity)
4×3 DataFrame
Row │ x y colors
│ Float64 Float64 Symbol
─────┼──────────────────────────
1 │ 2.0 2.5 red
2 │ 3.0 4.0 blue
3 │ 4.0 20.2 green
4 │ 5.0 20.4 cyan
Cols(in(names(df, Number)))
is the column selector here. names(df, Number)
returns the columns whose element type is a subtype of Number
. Cols
selects those columns which are in
this set of columns, as the ones to which the transform should be applied.
ByRow((c...) -> c ./ 5)
takes values from those columns in each row, and divides them by 5.
identity
just tells transform
not to change the column names in the result.
transform
above returns the result dataframe, without changing df
. You can use
transform!(df, Cols(in(names(df, Number))) => ByRow((c...) -> c ./ 5) => identity)
(note the !
after transform
) to do this operation in-place and update df
directly instead.
CodePudding user response:
tranform
is very powerful and it may feel more natural if you come from a dplyr
background, but I fill in this case using a simple loop over columns and broadcasting over items more natural.
Don't be afraid of loops in Julia: they are (generally speaking) as fast as vectorised code and can be written very concisely using array comprehension:
julia> df = DataFrame(x = [10, missing, 20, 25], y = [12.5, 20, 101, 102], colors = [:red, :blue, missing, :cyan])
4×3 DataFrame
Row │ x y colors
│ Int64? Float64 Symbol?
─────┼───────────────────────────
1 │ 10 12.5 red
2 │ missing 20.0 blue
3 │ 20 101.0 missing
4 │ 25 102.0 cyan
julia> [c .= c ./ 5 for c in eachcol(df) if nonmissingtype(eltype(c)) <: Number];
julia> df
4×3 DataFrame
Row │ x y colors
│ Int64? Float64 Symbol?
─────┼───────────────────────────
1 │ 2 2.5 red
2 │ missing 4.0 blue
3 │ 4 20.2 missing
4 │ 5 20.4 cyan
In the above example, the dot is to indicate broadcasting (each element of the column c
is the result of the division of the old value by the scalar 5) and I have used nonmissingtype
to account for the case where you may have missing data.