My df looks like this, representing a word (e.g., "A") and a corresponding embedding (which has way less dimensions than the real df):
word embedding
A [-0.0052, 0.0117, -0.0122]
B [-0.0026, 0.0123, -0.0140]
C [-0.0137, -0.0014, 0.0028]
I am struggling to convert the embedding variable into simple vectors like c(-0.0052, 0.0117, -0.0122)
that I can use to compare to other vectors (e.g., computing cosine similarity). The closest I got is to run this code:
df$embedding <- gsub("\\[|\\]", "", df$embedding)
df <- df %>% rowwise %>%
mutate(embedding = strsplit(embedding, split = ", "))#,
embedding = list(as.numeric(embedding)))
However, I still have lists stored in the embedding variable, as seen in the df structure:
> str(test2)
rowwise_df [3 x 2] (S3: rowwise_df/tbl_df/tbl/data.frame)
$ word : chr [1:3] "A" "B" "C"
$ embedding:List of 3
..$ : chr [1:3] "-0.0052" "0.0117" "-0.0122"
..$ : chr [1:3] "-0.0026" "0.0123" "-0.0140"
..$ : chr [1:3] "-0.0137" "-0.0014" "0.0028"
- attr(*, "groups")= tibble [3 x 1] (S3: tbl_df/tbl/data.frame)
..$ .rows: list<int> [1:3]
.. ..$ : int 1
.. ..$ : int 2
.. ..$ : int 3
.. ..@ ptype: int(0)
Can anyone help me get rid of the lists within my df?
df <- structure(list(word = c("A", "B", "C"), embedding = c("[-0.0052, 0.0117, -0.0122]", "[-0.0026, 0.0123, -0.0140]", "[-0.0137, -0.0014, 0.0028]")), row.names = c(NA, 3L), class = "data.frame")
CodePudding user response:
Using data.table
:
df[paste0("embed", 1:3)] <- lapply(
data.table::tstrsplit(gsub("\\[|\\]| ", "", df$embedding), ","),
as.numeric
)
# word embedding embed1 embed2 embed3
# 1 A [-0.0052, 0.0117, -0.0122] -0.0052 0.0117 -0.0122
# 2 B [-0.0026, 0.0123, -0.0140] -0.0026 0.0123 -0.0140
# 3 C [-0.0137, -0.0014, 0.0028] -0.0137 -0.0014 0.0028
CodePudding user response:
Use the reticulate
package as shown below:
df %>%
rowwise() %>%
mutate(embedding = list(reticulate::py_eval(embedding)))
word embedding
<chr> <list>
1 A <dbl [3]>
2 B <dbl [3]>
3 C <dbl [3]>
If using gsub
, then use scan
to read the values as numeric:
df %>%
rowwise() %>%
mutate(embedding = list(scan(text=gsub("[][]", "", embedding), sep=",")))
word embedding
<chr> <list>
1 A <dbl [3]>
2 B <dbl [3]>
3 C <dbl [3]>
If you want them in different columns:
cbind(df, read.table(text=gsub("[][]", "", df$embedding), sep = ','))
word embedding V1 V2 V3
1 A [-0.0052, 0.0117, -0.0122] -0.0052 0.0117 -0.0122
2 B [-0.0026, 0.0123, -0.0140] -0.0026 0.0123 -0.0140
3 C [-0.0137, -0.0014, 0.0028] -0.0137 -0.0014 0.0028
or even
df %>%
rowwise() %>%
mutate(embed = list(reticulate::py_eval(embedding))) %>%
unnest_wider(embed, names_sep = '')
# A tibble: 3 x 5
word embedding embed1 embed2 embed3
<chr> <chr> <dbl> <dbl> <dbl>
1 A [-0.0052, 0.0117, -0.0122] -0.0052 0.0117 -0.0122
2 B [-0.0026, 0.0123, -0.0140] -0.0026 0.0123 -0.014
3 C [-0.0137, -0.0014, 0.0028] -0.0137 -0.0014 0.0028