Here is what I'm trying to accomplish:
I want to create one row per company through summarizing. Company names are in "conm" column I want to only include a company in the table if the aggregated results from that company meet the following criteria:
- The company has more than 56 observations (quarters/rows) in the original, Df.
- Less than 1.18% of all of the company’s observations have a 4 in the digit after the decimal point of EPS (e.g., in the
EPS_diluted_cents
column). - More than 11% of all of the company’s observations have a 5 in the digit after the decimal point of EPS (e.g., in the
EPS_diluted_cents
column).
I'm trying to do it by taking these steps:
Create columns in the table for each of the three bullet points above and then use a filter function. I'd like to create a table with one observation for each company and the following columns: the total times the company had a 4 in the digit after the decimal point of EPS (e.g., in the EPS_diluted_cents
column), the total times the company had a 5 in the digit after the decimal point of EPS (e.g., in the EPS_diluted_cents
column), the total observations (quarters/rows) in the original dataset, the percentage of observations with 4 in the first digit after the decimal, and the percentage of observations with 5 in the first digit after the decimal.
Here is my head(df):
> head(df)
# A tibble: 6 × 10
tic conm fyearq fqtr curcdq ibadjq xidoq cshfdq EPS_diluted_cents digit_diluted
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 AIR AAR CORP 1998 3 USD 10.3 0 28.0 36.8 7
2 AIR AAR CORP 1998 4 USD 11.7 0 27.7 42.4 3
3 AIR AAR CORP 1999 1 USD 10.8 0 27.8 38.9 9
4 AIR AAR CORP 1999 2 USD 10.9 0 27.5 39.7 6
5 AIR AAR CORP 1999 3 USD 11.0 0 27.3 40.2 1
6 AIR AAR CORP 1999 4 USD 2.47 0 27.1 9.11 1
Here is my dput():
> dput(head(df,4))
structure(list(tic = structure(c("AIR", "AIR", "AIR", "AIR"), label = "Ticker Symbol", format.sas = "$"),
conm = structure(c("AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"
), label = "Company Name", format.sas = "$"), fyearq = structure(c(1998,
1998, 1999, 1999), label = "Fiscal Year", format.sas = "F"),
fqtr = structure(c(3, 4, 1, 2), label = "Fiscal Quarter", format.sas = "F"),
curcdq = structure(c("USD", "USD", "USD", "USD"), label = "ISO Currency Code", format.sas = "$"),
ibadjq = structure(c(10.278, 11.735, 10.831, 10.906), label = "Income Before Extraordinary Items - Adjusted for Common Stock Equivalents", format.sas = "F"),
xidoq = structure(c(0, 0, 0, 0), label = "Extraordinary Items and Discontinued Operations", format.sas = "F"),
cshfdq = structure(c(27.965, 27.69, 27.827, 27.489), label = "Com Shares for Diluted EPS", format.sas = "F"),
EPS_diluted_cents = structure(c(36.7530842124084, 42.3799205489346,
38.9226291012326, 39.6740514387573), label = "Income Before Extraordinary Items - Adjusted for Common Stock Equivalents", format.sas = "F"),
digit_diluted = c("7", "3", "9", "6")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"), label = "Merged Fundamental Quarterly File")
Here is my df(summary):
> summary(df)
tic conm fyearq fqtr curcdq ibadjq
Length:384308 Length:384308 Min. :1998 Min. :1.000 Length:384308 Min. : -6654.90
Class :character Class :character 1st Qu.:2002 1st Qu.:2.000 Class :character 1st Qu.: 1.34
Mode :character Mode :character Median :2007 Median :2.000 Mode :character Median : 6.77
Mean :2008 Mean :2.479 Mean : 101.59
3rd Qu.:2013 3rd Qu.:3.000 3rd Qu.: 35.59
Max. :2020 Max. :4.000 Max. :127090.00
xidoq cshfdq EPS_diluted_cents digit_diluted
Min. :-3437.0 Min. : 0.00 Min. :0.000e 00 Length:384308
1st Qu.: 0.0 1st Qu.: 10.74 1st Qu.:1.000e 01 Class :character
Median : 0.0 Median : 31.47 Median :2.800e 01 Mode :character
Mean : 3.6 Mean : 173.83 Mean :3.830e 03
3rd Qu.: 0.0 3rd Qu.: 93.00 3rd Qu.:5.600e 01
Max. :76413.2 Max. :52184.91 Max. :1.344e 09
CodePudding user response:
This is relatively untested as no companies in your sample data meet the criteria.
We'll define a helper function to get the first digit after the decimal, then filter out companies that don't meet the criteria, and take the distinct values of those that remain:
first_decimal = function(x) {
floor(((x %% 1) 1e-6) * 10)
## adding 1e-6 to prevent floating point precision issues
}
df %>%
group_by(conm) %>%
filter(
n() > 56,
mean(first_decimal(EPS_diluted_cents) == 4) < 0.0118,
mean(first_decimal(EPS_diluted_cents) == 5) > 0.11,
) %>%
ungroup() %>%
distinct(conm)
CodePudding user response:
df %>%
group_by(tic) %>%
mutate(eps_dig = floor((EPS_diluted_cents %% 1)*10)) %>% # EDIT
summarize(obs = n(),
eps_dig4_share = sum(eps_dig == 4) / obs,
eps_dig5_share = sum(eps_dig == 5) / obs) %>% # EDIT #2 had omitted %>%
filter(obs >= 56,
eps_dig4_share < 0.0118,
eps_dig5_share > 0.11)