How to search for rows that contain a special word case in-sensitively in a specific column in DataF-CodePudding

I was trying to find a specific dataset within RDatasets and since this package has provided 763 datasets with specific notations, it's hard to find out whether the dataset exists in the RDatasets. For example, I knew there exists a dataset that is about foods but I didn't know the exact name. So I searched for it to find out whether it's been provided by the RDatasets:

using RDatasets, DataFrames

# This line recalls information on all the provided datasets within the package
Rdatasets = RDatasets.datasets();

Rdatasets[occursin.("food", Rdatasets.Title), :]
# 0×5 DataFrame
#  Row │ Package   Dataset   Title   Rows   Columns
#      │ String15  String31  String  Int64  Int64
# ─────┴────────────────────────────────────────────

# Then I searched for "Food"
Rdatasets[occursin.("Food", Rdatasets.Title), :]
# 1×5 DataFrame
#  Row │ Package   Dataset     Title                              Rows   Columns
#      │ String15  String31    String                             Int64  Int64
# ─────┼─────────────────────────────────────────────────────────────────────────
#    1 │ Ecdat     BudgetFood  Budget Share of Food for Spanish…  23972        6

But I tried two times, and even I might give up on the further search. How can I find the row in the Rdatasets DataFrame that contains the food word case in-sensitively in its Title column (if there is any)?

CodePudding user response：

RegEx is everyone's friend! Even if you were looking for Iris in the Dataset column, you'd be in trouble because they provided the names case-sensitively. So an option is lower/upper case the contents of the preferred column using lowercase.(df.columnname) and then search for the word in the suitable corresponding matchcase, or you can use RegEx and hand over it to decide about the occurrence of the letters! So the latter helps you to find the specific word within a specific column in any dataframe with its specific notations (for example, maybe you had a dataframe that contained iris with the specific notation of "iRiS" in its column. Then it wouldn't be efficient to search for Iris or iris or iRis etc. until one works):

Rdatasets[occursin.(r"(?i)food", Rdatasets.Title), :]
# 1×5 DataFrame
#  Row │ Package   Dataset     Title                              Rows   Columns
#      │ String15  String31    String                             Int64  Int64
# ─────┼─────────────────────────────────────────────────────────────────────────
#    1 │ Ecdat     BudgetFood  Budget Share of Food for Spanish…  23972        6

In the above, I used a RegEx expression to search for any notation of the food word within the Title column of the Rdatasets DataFrame. The ?(i) turns case-insensitivity on and the r"" is:

help?> r""
  @r_str -> Regex


  Construct a regex, such as r"^[a-z]*$", without interpolation and unescaping (except for quotation mark "      
  which still has to be escaped). The regex also accepts one or more flags, listed after the ending quote, to    
  change its behaviour:

    •  i enables case-insensitive matching

    •  m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the
       whole string.

    •  s allows the . modifier to match newlines.

    •  x enables "comment mode": whitespace is enabled except when escaped with \, and # is treated as
       starting a comment.

    •  a disables UCP mode (enables ASCII mode). By default \B, \b, \D, \d, \S, \s, \W, \w, etc. match
       based on Unicode character properties. With this option, these sequences only match ASCII
       characters.

  See Regex if interpolation is needed.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> match(r"a .*b .*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
  RegexMatch("angry,\nBad world")


  This regex has the first three flags enabled.

Note that things can be more complicated like occurring special chars ($, @, etc.), and in those cases converting the content to uppercase or the opposite wouldn't be helpful! So using RegEx is the safest option.

CodePudding user response：

According to Julia documentation, Unicode.normalize("", casefold=true) is recommended to perform case-insensitive comparison.

Hence you want:

julia> Rdatasets[occursin.(Unicode.normalize("food",casefold=true), Unicode.normalize.(Rdatasets.Title,casefold=true)),:]
1×5 DataFrame
 Row │ Package   Dataset     Title                              Rows   Columns
     │ String15  String31    String                             Int64  Int64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ Ecdat     BudgetFood  Budget Share of Food for Spanish…  23972        6