TLDR; I've gotten myself into some trouble. I have a dataframe that I exported into different output files based on column ID. Now, I want to merge data from those files back into the original dataframe without a common index, knowing only that their order was preserved within the ID grouping.
I'm starting with a large dataframe (n > 30,000). One of my columns contains text documents in many different languages (text
). Another column (en
) indicates which language the document is.
MWE:
df <- data.frame(text = c('this is a test', 'questo è un test', 'cest un test', 'another test', 'un autre essai'),
en = c("en", "it", "fr", "en", "fr"))
I wanted to translate them all into English, but needed to do it manually without a Google Translate API. First I did the following to obtain a list of all unique non-English entries:
noneng <- subset(df, lang_2!="en")
noneng <- subset(noneng, !is.na(text))
noneng <- noneng[!duplicated(noneng[,c('text')]),]
I then grouped elements of this list by language group and exported them.
# export to sep files
write_mult = function(x) {
write.table(x,paste0("translate/untranslated_",unique(x$lang_1),".txt"))
return(x)
}
noneng %>%
group_by(lang_1) %>%
do(write_mult(.))
I end up with a bunch of delimited .txt files, one for each language group ID. Within each .txt is the complete list of unique rows for that language group.
untranslated_en.txt (2 entries)
untranslated_it.txt (2 entries)
untranslated_fr.txt (1 entry)
...
I painstakingly ran these through Google Translate and now have English language versions for each row of each .txt. I was planning on replacing the untranslated versions in the original dataframe with its translation.
What I (foolishly) didn't think about in advance was how I was going to match the translated versions back to the untranslated ones in original dataframe without having given them an index/key of some kind.
There must be a way out of this. For example,
- I know the original order of
noneng
before I exported it group_by
andunique
must have predictable output based on the order they're given.
So it seems like it should be possible to recover the order of the .txt files even without an index column.
If I knew that, then I could merge them based on order, even without an index. But having split the output into multiple .txts has thrown me off, and I'm totally stumped about how one would backward induct this with R?
I hope I'm explaining this clearly. I'd be grateful for any help.
CodePudding user response:
Perhaps something like this:
library(dplyr)
# helper function to add row numbers per language group
add_row_in_lang <- function(df) {
df %>% group_by(en) %>% mutate(row = row_number()) %>% ungroup()
}
# add row numbers, join to table with combined translations
df %>%
add_row_in_lang() %>%
left_join(
list.files(path = "translated_csvs", full.names = TRUE) %>%
purrr::map_dfr( read_csv ) %>%
add_row_in_lang(),
by = c("en", "row")
)
Note how I've specified to join just on en
and row
, since it sounds like in your case those will be the keys.
I'm not sure about the format of your CSVs, so I can't verify or know what tweaks might be required, but if you had the data frames you could do the same thing:
df_fr <- data.frame(text = c('cest un test', 'un autre essai'),
en = c("fr", "fr"), translation = c("This is a test", "Another test"))
df_it <- data.frame(text = c('questo è un test'),
en = c("it"), translation = c("This is a test"))
df_en <- data.frame(text = c('this is a test', 'another test'),
en = c("en"), translation = c("This is a test", "another test"))
df %>%
add_row_in_lang() %>%
left_join(
bind_rows(df_fr, df_it, df_en) %>%
add_row_in_lang(),
by = c("en", "row")
)
# A tibble: 5 × 5
text.x en row text.y translation
<chr> <chr> <int> <chr> <chr>
1 this is a test en 1 this is a test This is a test
2 questo è un test it 1 questo è un test This is a test
3 cest un test fr 1 cest un test This is a test
4 another test en 2 another test another test
5 un autre essai fr 2 un autre essai Another test
Note how in my example here, both tables have the original text -- but I specified that the join only use en
and the row
column that we added. Since in my example both files have a text
column, it outputs as text.x
(from the first table) and text.y
(from the 2nd one).
CodePudding user response:
Not sure what caused me to overthink this. I came up with a rough but straighforward fix. In case it's useful for anyone else, this is what I did (not as clean as Jon's answer but maybe more comprehensible for newbies).
This step is optional, but first I gave the dataframe an index post hoc and ran the export procedure again. I sampled the untranslated entries to note their indices. Then, I ordered the dataframe by group ID to confirm that the indices do match up (as expected).
noneng <- noneng[order(noneng$lang_1),]
I noted the group order in the sorted dataframe (alphabetical), imported each of the .txts, and used bind_rows
to create a new dataframe noneng_translated
of equal length. Sort that dataframe by ID. Finally, cbind(noneng, noneng_translated)
. Simple enough.
en <- data.frame(lapply(en, as.character))
it <- data.frame(lapply(it, as.character))
fr <- data.frame(lapply(fr, as.character))
...
newdf <- bind_rows(en,it,fr,...)
cbind(noneng, newdf)
Uglier and riskier than merging on an index would have been, but it did the trick. Curious how others would do this. I'll leave it up in case others in the future make similar mistakes.