Home > Enterprise >  split large data frame by unique column ID and print to a new txt file
split large data frame by unique column ID and print to a new txt file

Time:10-13

I have a large data frame from a single .txt file like so and where I have >90,000 rows spanning ~85 unique chr IDs:

region chr midPos Nsites fst
12323 102 37500 20012 0.387763
12323 102 62500 17353 0.359275
12323 102 87500 20059 0.360935
12323 102 112500 20132 0.447558
12323 102 137500 17996 0.476738
12323 102 162500 20275 0.363752
12323 102 187500 13649 0.53061
12323 103 262500 8897 0.629894
12323 103 287500 18855 0.360819
12323 103 312500 19559 0.486737
12323 103 337500 19973 0.501365
12323 104 362500 19102 0.361309
12323 104 387500 18688 0.331698
12323 104 412500 18620 0.548834

What I would like to do is group by chr ID so that I get a single .txt file that contains only those rows that belong to the unique chr ID and where each file caries the unique chr ID. For example, with my toy df, I would like to split that into 3 files where each file would be named 103.txt, 104.txt, 105.txt and contain only those rows that match. So file 102.txt will look like:

region chr midPos Nsites fst
12323 102 37500 20012 0.387763
12323 102 62500 17353 0.359275
12323 102 87500 20059 0.360935
12323 102 112500 20132 0.447558
12323 102 137500 17996 0.476738
12323 102 162500 20275 0.363752
12323 102 187500 13649 0.53061

I know that in R I can split the df by chrID by:

df<-read.table('master_dataframe.txt', header=TRUE)
data_list <- split(df, f = df$chr)
head(df)

but how do I print each one out with a unique ID? Thank you for any suggestions/ help!

CodePudding user response:

You could make a list of all the unique IDs. And then use a for loop to iterate over each unique D, then print the result of each iteration.

list_ids <- c(‘list here’) 

for (i in unique(list_ids)){
print(subset(list_ids, chr==i))
}

CodePudding user response:

OK, turns out this is actually pretty straight forward using python. Here is what I ended up doing.

cf = pd.read_csv('inputDataframe.txt', sep="  ", header = None, names = ["region", "chr", "midPos", "Nsites", "fst"])
grouped = df.groupby(df['chr'])

for name in df['chr'].unique():
        temporary_df = grouped.get_group(name)
        print(temporary_df)
        temporary_df.to_csv(f'fst_{name}_raw.txt',sep='\t')

So, I took the .txt file which was structured as above and then used a python script to read it in, group by chr ID and then the for loop iterates through and grabs each chr ID that is unique and prints it to a new .txt file. The one thing this did do was for each new chr.txt file, it added a new 1st column with the line row number from the main data frame. I did not figure out how to prevent this in python but used some unix scripting to fix it:

for i in fst_*.txt ;
do
    name1=${i%.*};
    name2=${name1%_*};
cut -f2,3,4,5,6 "$name1".txt > "$name2".txt
done

Probably some janky stuff but it got it done...

  • Related