I have two csv files, with multiple columns with text. They both have the same text, but in different languages. So for example csv1 would look like:
header1 header2
How are you Good
What day is it Friday
Whats your name Mary
And csv2 would be:
header1 header2
Qué tal estás Bien
Qué dia es Viernes
Cómo te llamas María
Now I want to randomize them both, but I need the translations to still be in the same order. In other words, I need the order of the indexes to be the same: if index 1 is ramdomized to be the last in csv1, I want the same for csv2:
header1 header2
What day is it Friday
Whats your name Mary
How are you Good
header1 header2
Qué dia es Viernes
Cómo te llamas María
Qué tal estás Bien
This is what I have done:
import pandas as pd
df = pd.read_csv('train.csv')
data = df.sample(frac=1)
However with this code, both csv files end up with different orders. Is there a way to randomize the files but fixing the order of the indexes?
I apologize if something is not well explained, it's my first time both in this website and coding.
CodePudding user response:
Say you have a file called english.csv
with the following contents:
header1 header2
How are you Good
What day is it Friday
What's your name Mary
And a file called spanish.csv
with the following contents:
header1 header2
Qué tal estás Bien
Qué dia es Viernes
Cómo te llamas María
If you want to randomly shuffle the rows the same way for both you could use np.random.permutation
to generate a shuffled row indices order:
import pandas as pd
import numpy as np
english_df = pd.read_csv('english.csv', sep='[\s]{2,}')
spanish_df = pd.read_csv('spanish.csv', sep='[\s]{2,}')
# Assuming english_df and spanish_df have same number of rows
shuffled_indices_order = np.random.permutation(len(english_df))
shuffled_english_df = english_df.iloc[shuffled_indices_order]
shuffled_spanish_df = spanish_df.iloc[shuffled_indices_order]
shuffled_english_df.to_csv('shuffled_english.csv', index=False)
shuffled_spanish_df.to_csv('shuffled_spanish.csv', index=False)
Possible output after running above:
shuffled_english.csv
:
header1,header2
What day is it,Friday
How are you,Good
What's your name,Mary
shuffled_spanish.csv
:
header1,header2
Qué dia es,Viernes
Qué tal estás,Bien
Cómo te llamas,María
CodePudding user response:
df1_shuff = df1.sample(frac=1)
df2_shuff = df2.reindex(df1_shuff.index)
Assuming the two df
s started with the same, regular RangeIndex
(which you get when doing the pd.read_csv()
as the OP does), then the two df_shuff
are both shuffled the same way.
I would add that the only additional line required after the OP's code is (assuming the other df
is named df2
, but replace as needed):
data2 = df2.resample(data.index)