Is there a way to take my current df and build a new table based off specific criteria?-CodePudding

Here is what I'm trying to accomplish:

I want to create one row per company through summarizing. Company names are in "conm" column I want to only include a company in the table if the aggregated results from that company meet the following criteria:

The company has more than 56 observations (quarters/rows) in the original, Df.
Less than 1.18% of all of the company’s observations have a 4 in the digit after the decimal point of EPS (e.g., in the EPS_diluted_cents column).
More than 11% of all of the company’s observations have a 5 in the digit after the decimal point of EPS (e.g., in the EPS_diluted_cents column).

I'm trying to do it by taking these steps: Create columns in the table for each of the three bullet points above and then use a filter function. I'd like to create a table with one observation for each company and the following columns: the total times the company had a 4 in the digit after the decimal point of EPS (e.g., in the EPS_diluted_cents column), the total times the company had a 5 in the digit after the decimal point of EPS (e.g., in the EPS_diluted_cents column), the total observations (quarters/rows) in the original dataset, the percentage of observations with 4 in the first digit after the decimal, and the percentage of observations with 5 in the first digit after the decimal.

Here is my head(df):

> head(df)
# A tibble: 6 × 10
  tic   conm     fyearq  fqtr curcdq ibadjq xidoq cshfdq EPS_diluted_cents digit_diluted
  <chr> <chr>     <dbl> <dbl> <chr>   <dbl> <dbl>  <dbl>             <dbl> <chr>        
1 AIR   AAR CORP   1998     3 USD     10.3      0   28.0             36.8  7            
2 AIR   AAR CORP   1998     4 USD     11.7      0   27.7             42.4  3            
3 AIR   AAR CORP   1999     1 USD     10.8      0   27.8             38.9  9            
4 AIR   AAR CORP   1999     2 USD     10.9      0   27.5             39.7  6            
5 AIR   AAR CORP   1999     3 USD     11.0      0   27.3             40.2  1            
6 AIR   AAR CORP   1999     4 USD      2.47     0   27.1              9.11 1

Here is my dput():

> dput(head(df,4))
structure(list(tic = structure(c("AIR", "AIR", "AIR", "AIR"), label = "Ticker Symbol", format.sas = "$"), 
    conm = structure(c("AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"
    ), label = "Company Name", format.sas = "$"), fyearq = structure(c(1998, 
    1998, 1999, 1999), label = "Fiscal Year", format.sas = "F"), 
    fqtr = structure(c(3, 4, 1, 2), label = "Fiscal Quarter", format.sas = "F"), 
    curcdq = structure(c("USD", "USD", "USD", "USD"), label = "ISO Currency Code", format.sas = "$"), 
    ibadjq = structure(c(10.278, 11.735, 10.831, 10.906), label = "Income Before Extraordinary Items - Adjusted for Common Stock Equivalents", format.sas = "F"), 
    xidoq = structure(c(0, 0, 0, 0), label = "Extraordinary Items and Discontinued Operations", format.sas = "F"), 
    cshfdq = structure(c(27.965, 27.69, 27.827, 27.489), label = "Com Shares for Diluted EPS", format.sas = "F"), 
    EPS_diluted_cents = structure(c(36.7530842124084, 42.3799205489346, 
    38.9226291012326, 39.6740514387573), label = "Income Before Extraordinary Items - Adjusted for Common Stock Equivalents", format.sas = "F"), 
    digit_diluted = c("7", "3", "9", "6")), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"), label = "Merged Fundamental Quarterly File")

Here is my df(summary):

> summary(df)
     tic                conm               fyearq          fqtr          curcdq              ibadjq         
 Length:384308      Length:384308      Min.   :1998   Min.   :1.000   Length:384308      Min.   : -6654.90  
 Class :character   Class :character   1st Qu.:2002   1st Qu.:2.000   Class :character   1st Qu.:     1.34  
 Mode  :character   Mode  :character   Median :2007   Median :2.000   Mode  :character   Median :     6.77  
                                       Mean   :2008   Mean   :2.479                      Mean   :   101.59  
                                       3rd Qu.:2013   3rd Qu.:3.000                      3rd Qu.:    35.59  
                                       Max.   :2020   Max.   :4.000                      Max.   :127090.00  
     xidoq             cshfdq         EPS_diluted_cents   digit_diluted     
 Min.   :-3437.0   Min.   :    0.00   Min.   :0.000e 00   Length:384308     
 1st Qu.:    0.0   1st Qu.:   10.74   1st Qu.:1.000e 01   Class :character  
 Median :    0.0   Median :   31.47   Median :2.800e 01   Mode  :character  
 Mean   :    3.6   Mean   :  173.83   Mean   :3.830e 03                     
 3rd Qu.:    0.0   3rd Qu.:   93.00   3rd Qu.:5.600e 01                     
 Max.   :76413.2   Max.   :52184.91   Max.   :1.344e 09

CodePudding user response：

This is relatively untested as no companies in your sample data meet the criteria.

We'll define a helper function to get the first digit after the decimal, then filter out companies that don't meet the criteria, and take the distinct values of those that remain:

first_decimal = function(x) {
  floor(((x %% 1)   1e-6)  * 10)
  ## adding 1e-6 to prevent floating point precision issues
}

df %>% 
  group_by(conm) %>%
  filter(
    n() > 56,
    mean(first_decimal(EPS_diluted_cents) == 4) < 0.0118,
    mean(first_decimal(EPS_diluted_cents) == 5) > 0.11,
  ) %>%
  ungroup() %>%
  distinct(conm)

CodePudding user response：

df %>%
  group_by(tic) %>%
  mutate(eps_dig = floor((EPS_diluted_cents %% 1)*10)) %>%        # EDIT
  summarize(obs = n(),
         eps_dig4_share = sum(eps_dig == 4) / obs,
         eps_dig5_share = sum(eps_dig == 5) / obs) %>%  # EDIT #2 had omitted %>%
  filter(obs >= 56,
         eps_dig4_share < 0.0118,
         eps_dig5_share > 0.11)