I have a .csv file with 893 columns that I need to read into pandas (or R) to analyze. When the spreadsheet is generated it creates duplicate columns that need to be combined into one.
The problem I am having is that when I read the .csv into pandas or R to create a dataframe it automatically assigns a number to each additional duplicate column meaning they can't be grouped easily.
The original data is formatted like this:
****** PYTHON ******
!!!EXAMPLE!!!
import pandas as pd
d = {'Name':["Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"],
"Dates":["2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
d = pd.DataFrame(d)
d
****** R ******
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
FYI - this is just an example. The real data will be read in from a .csv file using df = pd.read_csv("dummy.csv")
or equivalent.
Is there any method to either;
- Read in a .csv and maintain the duplicate column names so they can be grouped and analyzed
or
- Group the duplicate column names at the point of reading the .csv in so they can be analyzed as one column
N.B: Interestingly I noticed when making the example that it won't even allow me to create a dataframe with columns the same name.
CodePudding user response:
The problem is that you are creating a dict where the keys are not unique, so this can't be created in the form you want it to be (the values are just overwritten with the last one). The dict itself is then correctly handed to pandas and used to create the DataFrame.
You could use for example a different method of adding the extra columns where you can explicitly allow duplicates.
import pandas as pd
d = {'Name': ["Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim",
"Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue"],
"Dates": ["2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
"2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
"2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28"],
"Event": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
d = pd.DataFrame(d)
d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
Which gives you:
> Name Dates Event Event Event
0 Jim 2010-1-1 1 1 1
1 Jim 2010-1-2 1 1 1
2 Jim 2010-01-5 1 1 1
3 Jim 2010-01-17 1 1 1
4 Jim 2010-01-20 1 1 1
5 Jim 2010-01-29 1 1 1
6 Jim 2010-02-6 1 1 1
7 Jim 2010-02-9 1 1 1
8 Jim 2010-02-16 1 1 1
9 Jim 2010-02-28 1 1 1
10 Sue 2010-1-1 1 1 1
11 Sue 2010-1-2 1 1 1
12 Sue 2010-01-5 1 1 1
13 Sue 2010-01-17 1 1 1
14 Sue 2010-01-20 1 1 1
15 Sue 2010-01-29 1 1 1
16 Sue 2010-02-6 1 1 1
17 Sue 2010-02-9 1 1 1
18 Sue 2010-02-16 1 1 1
19 Sue 2010-02-28 1 1 1
Edit:
If the columns exist but they just have an unwanted numbering [.1, .2, ...]
you can use the re
package:
import pandas as pd
import re
df = pd.read_csv("example.csv", sep=";")
> a b b.1 b.2
0 a 1 5 7
1 b 2 5 7
2 c 3 6 7
3 d 4 7 7
4 e 5 8 7
df.columns = [re.sub("(.*?)(\.\d )", "\\1", c) for c in df.columns]
> a b b b
0 a 1 5 7
1 b 2 5 7
2 c 3 6 7
3 d 4 7 7
4 e 5 8 7
CodePudding user response:
I would like to suggest this solution:
import pandas as pd
from io import StringIO
data = StringIO("""Name;Dates;Event;Event;Event
Jim;2010-1-1;1;1;1
Jim;2010-1-2;1;1;1
Jim;2010-01-5;1;1;1
Jim;2010-01-17;1;1;1
Jim;2010-01-20;1;1;1
Jim;2010-01-29;1;1;1
Jim;2010-02-6;1;1;1
Jim;2010-02-9;1;1;1
Jim;2010-02-16;1;1;1
Jim;2010-02-28;1;1;1
Sue;2010-1-1;1;1;1
Sue;2010-1-2;1;1;1
Sue;2010-01-5;1;1;1
Sue;2010-01-17;1;1;1
Sue;2010-01-20;1;1;1
Sue;2010-01-29;1;1;1
Sue;2010-02-6;1;1;1
Sue;2010-02-9;1;1;1
Sue;2010-02-16;1;1;1
Sue;2010-02-28;1;1;1
""")
df = pd.read_csv(data, sep=';')
df.rename(columns={col:col.split('.')[0] for col in df.columns}, inplace=True)
print(df)
Output:
Name Dates Event Event Event
0 Jim 2010-1-1 1 1 1
1 Jim 2010-1-2 1 1 1
2 Jim 2010-01-5 1 1 1
3 Jim 2010-01-17 1 1 1
4 Jim 2010-01-20 1 1 1
5 Jim 2010-01-29 1 1 1
6 Jim 2010-02-6 1 1 1
7 Jim 2010-02-9 1 1 1
8 Jim 2010-02-16 1 1 1
9 Jim 2010-02-28 1 1 1
10 Sue 2010-1-1 1 1 1
11 Sue 2010-1-2 1 1 1
12 Sue 2010-01-5 1 1 1
13 Sue 2010-01-17 1 1 1
14 Sue 2010-01-20 1 1 1
15 Sue 2010-01-29 1 1 1
16 Sue 2010-02-6 1 1 1
17 Sue 2010-02-9 1 1 1
18 Sue 2010-02-16 1 1 1
19 Sue 2010-02-28 1 1 1