I have a dataframe containing a text column, which has shortend urls in it. I need to filter out some of these shortend URLs. Which URLs to filter is signaled by their values in another column, that basically is a nested list, inside a dataframe-column which itself is nested inside a column. In the nested list are mostly NULL-values and three different types of dataframes, which do share some columns. I'll want to use the values of shared columns as a filter.
The dput(head(data$entities, 10))
output looks as following:
structure(list(mentions = list(structure(list(start = 0L, end = 5L,
username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L),
structure(list(start = c(0L, 17L), end = c(16L, 22L), username = c("Sebasti52801956",
"BILD"), id = c("1506995029205729292", "9204502")), class = "data.frame", row.names = 1:2),
structure(list(start = c(0L, 11L), end = c(10L, 16L), username = c("Nico_FHKo",
"BILD"), id = c("3166027097", "9204502")), class = "data.frame", row.names = 1:2),
structure(list(start = 0L, end = 5L, username = "BILD", id = "9204502"), class = "data.frame", row.names = 1L)),
hashtags = list(NULL, NULL, structure(list(start = c(82L,
106L, 133L), end = c(89L, 116L, 145L), tag = c("grünen",
"Bundestag", "Solidarität")), class = "data.frame", row.names = c(NA,
3L)), NULL, NULL, NULL, NULL, NULL, NULL, NULL), urls = list(
NULL, NULL, NULL, NULL, NULL, NULL, structure(list(start = 6L,
end = 29L, url = "t.co/OC9Vo93CAp", expanded_url = "https://twitter.com/ThomasB59914994/status/1510264984034492417/photo/1",
display_url = "pic.twitter.com/OC9Vo93CAp", media_key = "16_1510264976740691973"), class = "data.frame", row.names = 1L),
structure(list(start = 141L, end = 164L, url = "t.co/fSa4TZmzFt",
expanded_url = "https://www.stern.de/digital/technik/sicher--klein-und-billig---china-baut-den-ersten-thorium-reaktor--30632008.html",
display_url = "stern.de/digital/techni…", images = list(
structure(list(url = c("https://pbs.twimg.com/news_img/1557443927816404992/VfYVrEGt?format=jpg&name=orig",
"https://pbs.twimg.com/news_img/1557443927816404992/VfYVrEGt?format=jpg&name=150x150"
), width = c(1440L, 150L), height = c(810L, 150L
)), class = "data.frame", row.names = 1:2)),
status = 200L, title = "Sicher, klein und billig – China baut den ersten Thorium-Reaktor",
description = "Er ist nicht größer als ein Badezimmer. In China wird ein Thorium-Reaktor in Betrieb genommen. 2030 soll es zur Serienproduktion kommen – die Mini-Reaktoren versprechen CO2-freien Strom ohne die Gefahr eines Gaus.",
unwound_url = "https://www.stern.de/digital/technik/sicher--klein-und-billig---china-baut-den-ersten-thorium-reaktor--30632008.html"), class = "data.frame", row.names = 1L),
NULL, NULL), annotations = list(NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL), cashtags = list(
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
NULL)), row.names = c(NA, 10L), class = "data.frame")
What i want to do now is to filter out those URLS in the text column, which correspont to a value of the display_url column inside a entities.url list stating "pic.".
Yet i tried this:
if (data$entities$urls != NULL) {
if (data$entities$urls$display_url == "pic.*\\s*"){
data$text <- gsub("http.*\\s*", "", data$text)
}
}
But I always receive the error
Error in if (replys_bu$entities$urls != NULL) { :argument is of length zero
I would appreciate any help
CodePudding user response:
Here is a vectorized way of replacing the strings in data$text
by new values, depending on the url's values being not NULL
.
Note that entities$urls
must be changed to data$entities$urls
i_not_null <- which(!sapply(entities$urls, is.null))
displ_url <- sapply(entities$urls[i_not_null], `[[`, 'display_url')
i_text <- i_not_null * grepl("^pic.*\\s*", displ_url)
i_text <- i_text[i_text != 0]
data$text[i_text] <- gsub("http.*\\s*", "", data$text[i_text])