Home > Enterprise >  Randomize two csv files but with the indexes in the same order
Randomize two csv files but with the indexes in the same order

Time:12-31

I have two csv files, with multiple columns with text. They both have the same text, but in different languages. So for example csv1 would look like:

header1               header2
How are you           Good
What day is it        Friday
Whats your name       Mary

And csv2 would be:

header1               header2
Qué tal estás         Bien
Qué dia es            Viernes
Cómo te llamas        María

Now I want to randomize them both, but I need the translations to still be in the same order. In other words, I need the order of the indexes to be the same: if index 1 is ramdomized to be the last in csv1, I want the same for csv2:

header1               header2
What day is it        Friday
Whats your name       Mary
How are you           Good


header1               header2
Qué dia es            Viernes
Cómo te llamas        María
Qué tal estás         Bien

This is what I have done:

import pandas as pd

df = pd.read_csv('train.csv')

data = df.sample(frac=1)

However with this code, both csv files end up with different orders. Is there a way to randomize the files but fixing the order of the indexes?

I apologize if something is not well explained, it's my first time both in this website and coding.

CodePudding user response:

Say you have a file called english.csv with the following contents:

header1               header2
How are you           Good
What day is it        Friday
What's your name      Mary

And a file called spanish.csv with the following contents:

header1               header2
Qué tal estás         Bien
Qué dia es            Viernes
Cómo te llamas        María

If you want to randomly shuffle the rows the same way for both you could use np.random.permutation to generate a shuffled row indices order:

import pandas as pd
import numpy as np

english_df = pd.read_csv('english.csv', sep='[\s]{2,}')
spanish_df = pd.read_csv('spanish.csv', sep='[\s]{2,}')
# Assuming english_df and spanish_df have same number of rows
shuffled_indices_order = np.random.permutation(len(english_df))
shuffled_english_df = english_df.iloc[shuffled_indices_order]
shuffled_spanish_df = spanish_df.iloc[shuffled_indices_order]
shuffled_english_df.to_csv('shuffled_english.csv', index=False)
shuffled_spanish_df.to_csv('shuffled_spanish.csv', index=False)

Possible output after running above:

shuffled_english.csv:

header1,header2
What day is it,Friday
How are you,Good
What's your name,Mary

shuffled_spanish.csv:

header1,header2
Qué dia es,Viernes
Qué tal estás,Bien
Cómo te llamas,María

CodePudding user response:

df1_shuff = df1.sample(frac=1)
df2_shuff = df2.reindex(df1_shuff.index)

Assuming the two dfs started with the same, regular RangeIndex (which you get when doing the pd.read_csv() as the OP does), then the two df_shuff are both shuffled the same way.

I would add that the only additional line required after the OP's code is (assuming the other df is named df2, but replace as needed):

data2 = df2.resample(data.index)
  • Related