Home > OS >  R: Convert word embeddings (strings) to vector in data frame
R: Convert word embeddings (strings) to vector in data frame

Time:08-24

My df looks like this, representing a word (e.g., "A") and a corresponding embedding (which has way less dimensions than the real df):

word  embedding
A     [-0.0052, 0.0117, -0.0122]
B     [-0.0026, 0.0123, -0.0140]
C     [-0.0137, -0.0014, 0.0028]

I am struggling to convert the embedding variable into simple vectors like c(-0.0052, 0.0117, -0.0122) that I can use to compare to other vectors (e.g., computing cosine similarity). The closest I got is to run this code:

df$embedding <- gsub("\\[|\\]", "", df$embedding)
df <- df %>% rowwise %>%
  mutate(embedding = strsplit(embedding, split = ", "))#,
         embedding = list(as.numeric(embedding)))

However, I still have lists stored in the embedding variable, as seen in the df structure:

> str(test2)
rowwise_df [3 x 2] (S3: rowwise_df/tbl_df/tbl/data.frame)
 $ word     : chr [1:3] "A" "B" "C"
 $ embedding:List of 3
  ..$ : chr [1:3] "-0.0052" "0.0117" "-0.0122"
  ..$ : chr [1:3] "-0.0026" "0.0123" "-0.0140"
  ..$ : chr [1:3] "-0.0137" "-0.0014" "0.0028"
 - attr(*, "groups")= tibble [3 x 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:3] 
  .. ..$ : int 1
  .. ..$ : int 2
  .. ..$ : int 3
  .. ..@ ptype: int(0) 

Can anyone help me get rid of the lists within my df?

df <- structure(list(word = c("A", "B", "C"), embedding = c("[-0.0052, 0.0117, -0.0122]", "[-0.0026, 0.0123, -0.0140]", "[-0.0137, -0.0014, 0.0028]")), row.names = c(NA, 3L), class = "data.frame")

CodePudding user response:

Using data.table:

df[paste0("embed", 1:3)] <- lapply(
  data.table::tstrsplit(gsub("\\[|\\]| ", "", df$embedding), ","),
  as.numeric
)

#   word                  embedding  embed1  embed2  embed3
# 1    A [-0.0052, 0.0117, -0.0122] -0.0052  0.0117 -0.0122
# 2    B [-0.0026, 0.0123, -0.0140] -0.0026  0.0123 -0.0140
# 3    C [-0.0137, -0.0014, 0.0028] -0.0137 -0.0014  0.0028

CodePudding user response:

Use the reticulate package as shown below:

df %>%
  rowwise() %>%
  mutate(embedding = list(reticulate::py_eval(embedding)))

  word  embedding
  <chr> <list>   
1 A     <dbl [3]>
2 B     <dbl [3]>
3 C     <dbl [3]>

If using gsub, then use scan to read the values as numeric:

df %>%
  rowwise() %>%
  mutate(embedding = list(scan(text=gsub("[][]", "", embedding), sep=",")))

  word  embedding
  <chr> <list>   
1 A     <dbl [3]>
2 B     <dbl [3]>
3 C     <dbl [3]>

If you want them in different columns:

cbind(df, read.table(text=gsub("[][]", "", df$embedding), sep = ','))
  word                  embedding      V1      V2      V3
1    A [-0.0052, 0.0117, -0.0122] -0.0052  0.0117 -0.0122
2    B [-0.0026, 0.0123, -0.0140] -0.0026  0.0123 -0.0140
3    C [-0.0137, -0.0014, 0.0028] -0.0137 -0.0014  0.0028

or even

df %>%
   rowwise() %>%
   mutate(embed = list(reticulate::py_eval(embedding))) %>%
   unnest_wider(embed, names_sep = '')

# A tibble: 3 x 5
  word  embedding                   embed1  embed2  embed3
  <chr> <chr>                        <dbl>   <dbl>   <dbl>
1 A     [-0.0052, 0.0117, -0.0122] -0.0052  0.0117 -0.0122
2 B     [-0.0026, 0.0123, -0.0140] -0.0026  0.0123 -0.014 
3 C     [-0.0137, -0.0014, 0.0028] -0.0137 -0.0014  0.0028
  • Related