Home > Blockchain >  How to equalize the number of variables from different datasets?
How to equalize the number of variables from different datasets?

Time:11-07

I have data sets from 5 different years(US stocks data sets) - 2014, 2015,2016,2017,2018. 2014 has 901 variables, while 2015- 1386, 2016-1469 and etc. I want to downsize all of them to 901 so I can easily compare them and show the movement of stocks form 2014 to 2018. How can I do this?

enter image description here

CodePudding user response:

The following should get you going, Ayaz.

  1. the following defines a dataframe for 2014 and 2015 to simulate your data sets.
  2. trims the 2nd data frame based on the stock-names found in the first data frame. You will notice, there are a few extra "names" and one "missing".

Note that we look for the names in stock_2014$STOCK. You may have them defined in another vector or pull them from elsewhere.

As you speak about filter, I assume you use the tidyverse. Here you can build your filter criteria on the names of the stock and use %in% to check for their occurences.

library(dplyr)

# simulate your data for 2 years 
stock_2014 <- data.frame(STOCK = c("A","B","C","D"), VALUE = c(12,34,56,78))
stock_2015 <- data.frame(STOCK = c("A", "A1","A2", "B", "D","D1","E"), VALUE = c(12,23,34,45,56,89,38))

stock_2015_trimmed <- stock_2015 %>% filter(STOCK %in% stock_2014$STOCK)
stock_2015_trimmed
  STOCK VALUE
1     A    12
2     B    45
3     D    56
  • Related