I was trying to find a specific dataset within RDatasets
and since this package has provided 763 datasets with specific notations, it's hard to find out whether the dataset exists in the RDatasets. For example, I knew there exists a dataset that is about foods but I didn't know the exact name. So I searched for it to find out whether it's been provided by the RDatasets
:
using RDatasets, DataFrames
# This line recalls information on all the provided datasets within the package
Rdatasets = RDatasets.datasets();
Rdatasets[occursin.("food", Rdatasets.Title), :]
# 0×5 DataFrame
# Row │ Package Dataset Title Rows Columns
# │ String15 String31 String Int64 Int64
# ─────┴────────────────────────────────────────────
# Then I searched for "Food"
Rdatasets[occursin.("Food", Rdatasets.Title), :]
# 1×5 DataFrame
# Row │ Package Dataset Title Rows Columns
# │ String15 String31 String Int64 Int64
# ─────┼─────────────────────────────────────────────────────────────────────────
# 1 │ Ecdat BudgetFood Budget Share of Food for Spanish… 23972 6
But I tried two times, and even I might give up on the further search. How can I find the row in the Rdatasets
DataFrame that contains the food word case in-sensitively in its Title
column (if there is any)?
CodePudding user response:
RegEx is everyone's friend! Even if you were looking forIris
in the Dataset
column, you'd be in trouble because they provided the names case-sensitively. So an option is lower/upper case the contents of the preferred column using lowercase.(df.columnname)
and then search for the word in the suitable corresponding matchcase, or you can use RegEx and hand over it to decide about the occurrence of the letters! So the latter helps you to find the specific word within a specific column in any dataframe with its specific notations (for example, maybe you had a dataframe that contained iris with the specific notation of "iRiS" in its column. Then it wouldn't be efficient to search for Iris
or iris
or iRis
etc. until one works):
Rdatasets[occursin.(r"(?i)food", Rdatasets.Title), :]
# 1×5 DataFrame
# Row │ Package Dataset Title Rows Columns
# │ String15 String31 String Int64 Int64
# ─────┼─────────────────────────────────────────────────────────────────────────
# 1 │ Ecdat BudgetFood Budget Share of Food for Spanish… 23972 6
In the above, I used a RegEx expression to search for any notation of the food word within the Title
column of the Rdatasets
DataFrame. The ?(i)
turns case-insensitivity on and the r""
is:
help?> r""
@r_str -> Regex
Construct a regex, such as r"^[a-z]*$", without interpolation and unescaping (except for quotation mark "
which still has to be escaped). The regex also accepts one or more flags, listed after the ending quote, to
change its behaviour:
• i enables case-insensitive matching
• m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the
whole string.
• s allows the . modifier to match newlines.
• x enables "comment mode": whitespace is enabled except when escaped with \, and # is treated as
starting a comment.
• a disables UCP mode (enables ASCII mode). By default \B, \b, \D, \d, \S, \s, \W, \w, etc. match
based on Unicode character properties. With this option, these sequences only match ASCII
characters.
See Regex if interpolation is needed.
Examples
≡≡≡≡≡≡≡≡≡≡
julia> match(r"a .*b .*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")
This regex has the first three flags enabled.
Note that things can be more complicated like occurring special chars ($, @, etc.), and in those cases converting the content to uppercase
or the opposite wouldn't be helpful! So using RegEx is the safest option.
CodePudding user response:
According to Julia documentation, Unicode.normalize("", casefold=true)
is recommended to perform case-insensitive comparison.
Hence you want:
julia> Rdatasets[occursin.(Unicode.normalize("food",casefold=true), Unicode.normalize.(Rdatasets.Title,casefold=true)),:]
1×5 DataFrame
Row │ Package Dataset Title Rows Columns
│ String15 String31 String Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ Ecdat BudgetFood Budget Share of Food for Spanish… 23972 6