I have a large data frame from a single .txt file like so and where I have >90,000 rows spanning ~85 unique chr IDs:
region | chr | midPos | Nsites | fst |
---|---|---|---|---|
12323 | 102 | 37500 | 20012 | 0.387763 |
12323 | 102 | 62500 | 17353 | 0.359275 |
12323 | 102 | 87500 | 20059 | 0.360935 |
12323 | 102 | 112500 | 20132 | 0.447558 |
12323 | 102 | 137500 | 17996 | 0.476738 |
12323 | 102 | 162500 | 20275 | 0.363752 |
12323 | 102 | 187500 | 13649 | 0.53061 |
12323 | 103 | 262500 | 8897 | 0.629894 |
12323 | 103 | 287500 | 18855 | 0.360819 |
12323 | 103 | 312500 | 19559 | 0.486737 |
12323 | 103 | 337500 | 19973 | 0.501365 |
12323 | 104 | 362500 | 19102 | 0.361309 |
12323 | 104 | 387500 | 18688 | 0.331698 |
12323 | 104 | 412500 | 18620 | 0.548834 |
What I would like to do is group by chr ID so that I get a single .txt file that contains only those rows that belong to the unique chr ID and where each file caries the unique chr ID. For example, with my toy df, I would like to split that into 3 files where each file would be named 103.txt, 104.txt, 105.txt and contain only those rows that match. So file 102.txt will look like:
region | chr | midPos | Nsites | fst |
---|---|---|---|---|
12323 | 102 | 37500 | 20012 | 0.387763 |
12323 | 102 | 62500 | 17353 | 0.359275 |
12323 | 102 | 87500 | 20059 | 0.360935 |
12323 | 102 | 112500 | 20132 | 0.447558 |
12323 | 102 | 137500 | 17996 | 0.476738 |
12323 | 102 | 162500 | 20275 | 0.363752 |
12323 | 102 | 187500 | 13649 | 0.53061 |
I know that in R I can split the df by chrID by:
df<-read.table('master_dataframe.txt', header=TRUE)
data_list <- split(df, f = df$chr)
head(df)
but how do I print each one out with a unique ID? Thank you for any suggestions/ help!
CodePudding user response:
You could make a list of all the unique IDs. And then use a for loop to iterate over each unique D, then print the result of each iteration.
list_ids <- c(‘list here’)
for (i in unique(list_ids)){
print(subset(list_ids, chr==i))
}
CodePudding user response:
OK, turns out this is actually pretty straight forward using python. Here is what I ended up doing.
cf = pd.read_csv('inputDataframe.txt', sep=" ", header = None, names = ["region", "chr", "midPos", "Nsites", "fst"])
grouped = df.groupby(df['chr'])
for name in df['chr'].unique():
temporary_df = grouped.get_group(name)
print(temporary_df)
temporary_df.to_csv(f'fst_{name}_raw.txt',sep='\t')
So, I took the .txt file which was structured as above and then used a python script to read it in, group by chr ID and then the for loop iterates through and grabs each chr ID that is unique and prints it to a new .txt file. The one thing this did do was for each new chr.txt file, it added a new 1st column with the line row number from the main data frame. I did not figure out how to prevent this in python but used some unix scripting to fix it:
for i in fst_*.txt ;
do
name1=${i%.*};
name2=${name1%_*};
cut -f2,3,4,5,6 "$name1".txt > "$name2".txt
done
Probably some janky stuff but it got it done...