How do I remove data that is not relevant for my research?-CodePudding

I'm very new to R. I am doing an exam where I have chosen to only be interested in part of my dataset. The dataset is concerned with US companies. I am only interested in the companies in the "Finance and Insurance" and the "Real Estate and Rental and Leasing" sectors. The sector is indicated through "The North American Industry Classification code", where the sector is the first two digits in the 6 digits 'code'.

As I said, I am very new to R. But I have tried for a long time to figure this out. In my head, it would make the most sense to create a new column with a binary variable that indicates whether the company is within one of these two sectors and then later exclude data on that background. But I have failed to be able to create this new column.

I will be thankful for any help on how to do this. Either for creating the binary variable or just excluding the data that is not relevant.

#### Data ####

lobby_clean 

compusat 

politicians


#### Clean the "gvkey" for characters and convert to integers ####

lobby_clean[,c(1)]<-sapply(lobby_clean[,c(1)],as.numeric)

#### Merge the different datasets into one ####

lobby_compusat<-inner_join(lobby_clean, compusat, by ="gvkey")

lobby_compusat_politician <- inner_join(lobby_compusat, politicians, by="gvkey")

#### Group by year ####

mean_expend_by_year <- lobby_compusat_politician %>% 
  group_by(year.x) %>% 
  summarise(mean_expend=mean(expend))

#### Construct a plot of the data showing the development of the lobbying expenditures over the years among all companies####

lobbying_development <- ggplot(data = mean_expend_by_year,mapping=aes(x=year.x,y=mean_expend)) 
  geom_col()  
  labs(title = "Development in lobbying expenditure over time", x="Year", y="Average lobbying expenditures")

show(lobbying_development)

#### Exclude data that does not belong to the relevant sectors ####
#### Relevant sectors are :
####"Finance and Insurance", code starts with: 52
#### "Real Estate and Rental and Leasing", code starts with: 53

## Create a new column based on the two first numbers in "naics" that defines the sector to which the company belongs##

CodePudding user response：

You are using a combination of tidyverse and base R code but I will give some hints using the tidyverse. Generally it is helpful if you provide a little bit more information for us to work with - even a snippet of your data would help.

To extract the first two digits from the "The North American Industry Classification code" you can add a mutate statement like

library(tidyverse)
df <- df %>% mutate(sector = str_sub(naicc, start = 1, end = 2))

You can then filter to include only the two industries you are interested in

df <- df %>% filter(sector %in% c("52", "53") )

Hopefully that will start you off in the right direction.