split large data frame by unique column ID and print to a new txt file-CodePudding

I have a large data frame from a single .txt file like so and where I have >90,000 rows spanning ~85 unique chr IDs:

region	chr	midPos	Nsites	fst
12323	102	37500	20012	0.387763
12323	102	62500	17353	0.359275
12323	102	87500	20059	0.360935
12323	102	112500	20132	0.447558
12323	102	137500	17996	0.476738
12323	102	162500	20275	0.363752
12323	102	187500	13649	0.53061
12323	103	262500	8897	0.629894
12323	103	287500	18855	0.360819
12323	103	312500	19559	0.486737
12323	103	337500	19973	0.501365
12323	104	362500	19102	0.361309
12323	104	387500	18688	0.331698
12323	104	412500	18620	0.548834

What I would like to do is group by chr ID so that I get a single .txt file that contains only those rows that belong to the unique chr ID and where each file caries the unique chr ID. For example, with my toy df, I would like to split that into 3 files where each file would be named 103.txt, 104.txt, 105.txt and contain only those rows that match. So file 102.txt will look like:

region	chr	midPos	Nsites	fst
12323	102	37500	20012	0.387763
12323	102	62500	17353	0.359275
12323	102	87500	20059	0.360935
12323	102	112500	20132	0.447558
12323	102	137500	17996	0.476738
12323	102	162500	20275	0.363752
12323	102	187500	13649	0.53061

I know that in R I can split the df by chrID by:

df<-read.table('master_dataframe.txt', header=TRUE)
data_list <- split(df, f = df$chr)
head(df)

but how do I print each one out with a unique ID? Thank you for any suggestions/ help!

CodePudding user response：

You could make a list of all the unique IDs. And then use a for loop to iterate over each unique D, then print the result of each iteration.

list_ids <- c(‘list here’) 

for (i in unique(list_ids)){
print(subset(list_ids, chr==i))
}

CodePudding user response：

OK, turns out this is actually pretty straight forward using python. Here is what I ended up doing.

cf = pd.read_csv('inputDataframe.txt', sep="  ", header = None, names = ["region", "chr", "midPos", "Nsites", "fst"])
grouped = df.groupby(df['chr'])

for name in df['chr'].unique():
        temporary_df = grouped.get_group(name)
        print(temporary_df)
        temporary_df.to_csv(f'fst_{name}_raw.txt',sep='\t')

So, I took the .txt file which was structured as above and then used a python script to read it in, group by chr ID and then the for loop iterates through and grabs each chr ID that is unique and prints it to a new .txt file. The one thing this did do was for each new chr.txt file, it added a new 1st column with the line row number from the main data frame. I did not figure out how to prevent this in python but used some unix scripting to fix it:

for i in fst_*.txt ;
do
    name1=${i%.*};
    name2=${name1%_*};
cut -f2,3,4,5,6 "$name1".txt > "$name2".txt
done

Probably some janky stuff but it got it done...