Home > OS >  Create loop to subset data by month and year
Create loop to subset data by month and year

Time:02-08

UPDATE: I have added the dput() input at the bottom of the post.

I have a large dataset of tweets that I would like to subset by month and year.

data_cleaning$date <- as.Date(data_cleaning$created_at, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"), optional = FALSE)

I used the line of code above to format the date variable in the dataframe below.

id        text                                                                date
1389598   Honored to receive the United Way Award                             2012-05-22
1586410   Joining @TamronHall at 2:15pm today to discuss my bipartisan bill   2013-12-18
138058    Leahy leads 58 senators in lttr to #SecDef Hage                     2013-12-21
1482508   On one anniversary of #Sandy                                        2013-10-29
526459    #SocSec and #Medicare are absolutely vital                          2012-02-02
687900    Check out #MadeInWi BomBoard of Whitewater                          2014-04-14
826551    Congratulations to all of the hard-working @UTSA students           2013-12-21
1409462   Great to see so many proud Colombians                               2012-07-21
1807754   It is our duty to look after our veterans.                          2012-05-28
138057    Leahy presses preserv. of #TotalArmy capability                     2013-12-21

I know how to manually create a subset by month with the following code:

data_cleaning <-  data_cleaning %>% filter(date >= "2012-01-1")
data_cleaning <-  data_cleaning %>% filter(date <= "2012-02-1")

But I have 15 years worth of tweets that I would like to create a loop to subset this data by month and year, such that I have separate files for each. For example, ideally, I'd like to have files called "tweets_Jan2012", tweets_Feb2012, tweets_Mar2012", and so on, for each month that appears in the dataset.

Any help would be greatly appreciated.

Show in New WindowClear OutputExpand/Collapse Output
structure(list(author_id = c("242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999", "242555999", "242555999", "242555999", 
"242555999", "242555999"), text = c("anyon feel unsaf home time alon confidenti help avail even re social distanc https t co jkmayado", 
"actual think rue day decid tri intimid major backfir bravo michael https t co sidonllehr", 
"congratul michaelemann first rate scientist fearless amp resolut remorseless attack fossil fuel forc https t co uyleddecm", 
"today last day rhode island can sign health insur healthsourceri s special enrol period covid look option getcov https t co gkdyasc", 
"april th https t co gskujfum", "translat hey captur court fair squar s want re go threaten point https t co ngtcofm", 
"btw s coincid winner partisan decis big republican donor interest wsjopinion page alway flack", 
"s hard defend dark money prowl around court inde s indefens defens pretend s happen amp just ignor fact overview under problem read https t co bkxvoyyndi", 
"appreci shoutout whitehous effect s control suprem court look instead dark money fedsoc select justic dark money judicialnetwork s role confirm justic amp case win record republican donor https t co cxfqfzf", 
"imposs burger https t co rtbbfqjbap", "question politico playbook ask present answer reconven senat https t co esnzuwqj", 
"tomorrow last day healthsourceri s covid special enrol period uninsur rhode island can sign https t co kyuvaj", 
"signal anyth anyon hello interest fund mitch mcconnel amp parti fund dark money machin surround amp stock suprem court obedi court will unpopular thing can t make elect republican https t co zjpfhvbua", 
"markpatinkin man roll amaz https t co lbqag", "think corpor greedi check million plus dollar incom individu extract averag million care coronavirus bill took hospit got https t co kfuidhx", 
"prioriti protocol senat oper safe push judg big donor fund judici influenc machin republican parti", 
"trump nomin walker second power court nation walker s qualif near press appear defend buddi kavanaugh amp track record zealotri support republican parti polit goal like kill aca https t co nnmkvwgc", 
"mcconnel delay covid relief bill fli kentucki brett kavanaugh celebr yet anoth unqualifi partisan trump judg justin walker https t co ovquouv", 
"small bank still boil mad custom shoulder asid big bank second ppp round power superior autom sba fix happen right amp fair", 
"ad ppp money meet need small busi particular small bank amp credit union got push asid big bank now big bank jam system autom applic frustrat small bank sbagov need fix", 
"truth law got way even republican appoint justic couldn t stomach deliv big win gun manufactur isn t https t co ngtcofm", 
"nra huge republican donor interest didn t get want suprem court right wing donor interest machin kick high gear get justic back line wsjopinion take lead cours https t co goiuipzcb", 
"want abl safe reopen key test https t co ytmjalmvd", "coronavirus warn pdb repeat warn convey presid s daili brief alarm appear fail regist presid routin skip read pdb https t co cfsbemtn", 
"great graphic nytim worth look well done https t co yqevgdh", 
"imagin british paid peopl follow paul rever say british aren t come big hoax fake news can go back bed https t co mvqrxfi", 
"climat scientist paul rever climat chang ring bell loud increas urgenc sinc s https t co eqjtsbknf", 
"mood markwarn tuna melt d recommend tri rhode island s delici seafood instead mayo microwav requir https t co euojwft", 
"thank break true mitch mcconnel special billion dollar special tax provis tuck last covid relief bill benefit millionair amp billionair https t co vimhica", 
"ri great colleg univers deserv support thank brownunivers presid paxson https t co ubqmurtgfi", 
"meanwhil right wing effort exact bring case front group support front group file amicus brief fund manipul crew lot turn case https t co lkhgypp", 
"crack end republican justic gripe court s docket manipul https t co vcaaokwd", 
"true one never know crew strong oppos seema verma unilater hurt rhode island hospit one cmsgov hhsgov abl defend https t co aavdutava", 
"look like sorri episod will end decent note https t co yxgqkl", 
"west warwick teacher make meet hero section https t co edhdkxoc", 
"definit first time histori feder health agenc issu formal warn listen presid s offici https t co jbezonef", 
"good night light adapt coronavirus start way say good night headlight emerg vehicl light hospit children get flashlight flash good night back s beauti thing even without coronavirus https t co mcggau", 
"https t co btrocxmbau", "just anoth public servic whack job creepi dark money land oy https t co sonieqd", 
"first need conquer enemi within corrupt anti scienc anti govern ideolog ordinarili kept aliv fring crank now foment massiv industri resent scientif check public oversight interfer money make https t co xlhcpuetp", 
"heartbreak will back https t co qtggaeopvt", "mitch mcconnel say prioriti covid allow covid legisl bear mind scam flood billion wealthi investor donor make million dollar year covid connect none https t co oeerchqhf", 
"davidcicillin alexthomasdc mzanona sarahnferri heatherscop seungminkim jenhab elistokol markzbarabak mffisher roigfranzia timcarman olgamassov maurajudki laurahayesdc scottagilmor megetz jeffreynpark lisabono afreedma annmmaloney lottieanddoof kfleisher bjeanclement jbonn patijinich jkfrel ericmgarcia emilyaheil mrdanzak juliajfish aaronsfish edmet joelachenbach davidvondrehl abcarianlat chrismegerian davidlaut raineytim ec schneider theodoricmey jdawsey noahbierman ddiamond rachanadixit name perfect movi nomin peopl godfath casablanca blaze saddl butch cassidi sundanc kid darkest hour jimlangevin peterneronha neildsteinberg dawneuer jamesdiossa", 
"republican snuck billion millionairesgiveaway pandem relief bill even though noth fight covid replloyddoggett amp introduc legisl repeal shame tax break amp ensur aid goe american need https t co csyldjqgn", 
"friend talk treat coronavirus lysol inject lung worri https t co ymjlzmrtj", 
"friend exhibit kind behavior worri https t co puoicxw", "pelosi even make mask look cool win mcname getti imag https t co obbxzvujg", 
"good night earth https t co dlcuyunow", "pope franci see natur tragedi earth s respons maltreat ad sin earth neighbor end creator time clean act https t co lrkynccunk", 
"markpatinkin found moment chronicl coronavirus hero anoth amaz piec hope get pulitz https t co kmihfuzuu", 
"serious mitch mcconnel want get state declar bankruptci d better get economist fed treasuri tell look like economi https t co pvhfqoymc", 
"republican gave million dollar tax break millionair vital covid relief packag s time regular american get check", 
"hell republican think jam high incom big donor link covid giveaway follow money https t co xqjpstjnxp", 
"ag threaten governor fight keep citizen safe covid doj serious go sue state tri protect citizen https t co ifzmlkzyd", 
"earth hero earth day thank billmckibben https t co kdvpav", 
"ps corpor america overal noth congress fossil fuel mischief continu big compani thing want climat effort trade associ good net result corpor america oppos congress climat action time fix", 
"corrupt america s govern near evil pollut planet ocean must stop", 
"can make earth day promis reclaim healthi earth reclaim healthi polit expos throw crook dark money fake scienc phoni front group apparatus restor democraci work", 
"fossil fuel industri live subsidi measur intern monetari fund north billion per year u s give industri massiv incent corrupt polit citizen unit robert five gave tool", 
"bipartisanship climat stop dead lost decad fossil fuel fake scienc dark money phoni front group built whole apparatus web denial lie cheat bulli industri scale work", 
"januari infam citizen unit decis present unlimit polit spend inde unlimit anonym polit spend fossil fuel industri instant use new power corrupt polit", 
"earth day s good rememb lot good bipartisan climat legisl senat republican presidenti nomine solid climat platform", 
"serious carbon price stop warm heat amp acidif co", "rememb earth day ocean day can start pass marin plastic bill sos ocean data bill blue globe amp real dollar ocean coastal fund nocsf https t co jryavxcb", 
"re serious solut model tool can help get us safe climat republican even meet won t solv climat without break corrupt grip fossil fuel parti spoiler carbon price work https t co kfhuttiav", 
"happi earthday free globe fossil fuel corrupt https t co rsshjcf", 
"trump administr promis march th near month ago million coronavirus test conduct u s s april st haven t reach million test https t co eoioxyj", 
"today pledg continu fight antisemit amp educ everyon holocaust never happen holocaustremembranceday", 
"get serious renew energi can creat million good stabl job american current work oil amp gas can transit", 
"let s clear s less fossil fuel corpor pillar economi fossil fuel donor central pillar republican fundrais henc parti s creepi crawl obedi fossil fuel interest https t co ddovmoii", 
"senschum just now https t co dlnvvpem", "thank jimlangevin davidcicillin hous stand firm back get way improv posit", 
"sba hospit test s unfortun mitch insist jam us negoti get big victori insist bipartisanship worth wait", 
"faint heart panick misl mitch mcconnel tri jam us sba fund prepar much better bill", 
"throw taxpay dollar behind compani expos economi massiv crash risk realli bad idea s m team brianschatz call federalreserv consid climat risk emerg lend program https t co kojqmrl", 
"still clear direct trump administr money state money educ remain money hospit etc seem like long damn time sinc pass bill s holdup", 
"top straight b giveaway million dollar plus earner republican snuck care s problem need fix money need flow actual small busi worker https t co zhyvmjco", 
"proud polari mep ritin other organ help crisi https t co fwpmrhsap run group busi look purchas suppli mask includ product ri s world class textil compani https t co ktrasoevo", 
"perhap newspap check rhode island s hospit wisdom cut loos negoti go sba https t co nwkjtxjta", 
"republican pretend isn t true preserv flow fossil fuel industri money depend https t co xutocwmwv", 
"happen climat disast take place time coronavirus will learn hear warn https t co gfoxydnra", 
"oh come re put chuck schumer categori mitch mcconnel serious https t co fwirvmjjw", 
"crash number mani marin speci rais price survivor high hunt feroci marin protect essenti stop spiral https t co badbghw", 
"still come learn anyth listen warn https t co bzegjn", "happen happen partisan decis suprem court mani mani advanc big republican donor interest https t co ghwhhwewlw", 
"health care worker plead person protect equip keep amp patient safe need presid talk less amp getuspp https t co uptrrbt", 
"happen trump administr catastroph fail feder coordin leadership everi minut governor health care provid spend tumult wast time better spent problem https t co ikswhfnug", 
"net meter help reduc wast fight climat chang keep cost consum ferc act swift reject shadi front group petit make net meter harder https t co uadxpkivi", 
"hah somebodi els agre corrupt anti govern anti scienc patholog virus bodi polit thank maxboot https t co gmtqmob", 
"earli nation data point troubl racial dispar need research help rais alarm amp protect vulner communiti", 
"proud join senwarren amp timkain legisl ensur hhsgov report vital racial amp demograph data test treatment amp fatal rate covid https t co miqseophbd", 
"anyon care give odd mention climat chang advoc carbon price claim support sincer check bet flunk https t co qobnzxj", 
"markpatinkin fabul job sing unsung hero hero https t co sxygmdreba", 
"consensus price carbon emiss pollut free unfair hinder success climat crisi https t co euqvswdygv", 
"hope panel will good faith bipartisan effort inform scienc best practic", 
"accept presid trump s invit serv bipartisan panel charg determin guidelin safe restart american economi amid covid pandem https t co wiqygnahzh", 
"confront covid amp reviv economi congress pass bill send resourc direct rier can find answer question cash payment health protect small busi relief program amp much websit https t co sjjunovzw https t co odhcokgut", 
"white guy rifl trump sign woman chief execut lock new materi guy https t co mbxsdofp", 
"excel piec immens talent janemayerny dark money fame mitch mcconnel polit dark money oper explain destroy anyth senat stand door corrupt goal https t co eypezjsi", 
"didn t mean us https t co qlxzojkzv"), date = structure(c(18382, 
18382, 18382, 18382, 18382, 18382, 18382, 18382, 18382, 18381, 
18381, 18381, 18381, 18381, 18381, 18380, 18380, 18380, 18380, 
18380, 18380, 18380, 18380, 18380, 18380, 18379, 18379, 18379, 
18379, 18379, 18379, 18379, 18379, 18378, 18378, 18378, 18378, 
18377, 18377, 18377, 18377, 18376, 18376, 18376, 18376, 18376, 
18376, 18376, 18375, 18375, 18375, 18375, 18375, 18375, 18374, 
18374, 18374, 18374, 18374, 18374, 18374, 18374, 18374, 18374, 
18374, 18374, 18373, 18373, 18373, 18373, 18373, 18373, 18373, 
18373, 18373, 18372, 18372, 18372, 18372, 18372, 18371, 18371, 
18371, 18370, 18370, 18370, 18369, 18369, 18369, 18369, 18369, 
18369, 18368, 18368, 18368, 18368, 18368, 18368, 18368, 18368
), class = "Date")), row.names = c(NA, -100L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fac0a810ee0>)

CodePudding user response:

# set as data.table
setDT(data_cleaning)


# create year month column
data_cleaning[, year_month := substr(date, 1, 7)]


# split and put into list
split(data_cleaning, data_cleaning$year_month)

each element of this list will correspond to each group of the year_month column. Please try uploading a reproducible set off data with dput(data_cleaning) for your question.

CodePudding user response:

You may want to consider making use of the nest function. Drawing on the following:

But I have 15 years worth of tweets that I would like to create a loop to subset this data by month and year, such that I have separate files for each.1

You can create a list of nested frames in the following manner:

library("tidyverse")
data_nested <- sample_data %>%
    mutate(year_month = format.Date(date, "%Y-%m")) %>%
    group_by(year_month) %>%
    nest()

You can then save every single data frame to a separate CSV file using year_month value in the file name. This function can be easily manipulated if you want to add further elements to the file name.

map2(data_nested$data, data_nested$year_month,
     ~ write_csv(x = .x, file = paste0(.y, ".csv")))

Notes

  • sample_data originates from the provided dput output; this data will result only in one file 2020-04.csv

1 You may want to explore group_by and nest functions as those are fairly efficient and may be more convenient than handling data via separate files. Finally for "big data" challenges, solutions like spark may be more efficient than having to loop over multiple files.

  •  Tags:  
  • Related