I have a column in a dataframe whose type is object. This column contains both arrays and strings and its unique values are these:
array(['[nan]', "['3. Medical and Health Sciences']",
"['1. Natural Sciences']", "['2. Engineering and Technology']",
"['1. Natural Sciences' '3. Medical and Health Sciences']",
"['3. Medical and Health Sciences' '1. Natural Sciences']",
"['5. Social Sciences']",
"['1. Natural Sciences' '4. Agricultural Sciences'\n '3. Medical and Health Sciences']",
"['1. Natural Sciences' '6. Humanities']",
"['4. Agricultural Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology']",
"['2. Engineering and Technology' '1. Natural Sciences']",
"['6. Humanities']",
"['3. Medical and Health Sciences' '6. Humanities']",
"['4. Agricultural Sciences' '3. Medical and Health Sciences']",
"['3. Medical and Health Sciences' '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '4. Agricultural Sciences']",
"['2. Engineering and Technology' '3. Medical and Health Sciences']",
"['1. Natural Sciences' '4. Agricultural Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences']",
"['4. Agricultural Sciences' '2. Engineering and Technology']",
"['6. Humanities' '1. Natural Sciences']",
"['6. Humanities' '2. Engineering and Technology']",
"['2. Engineering and Technology' '5. Social Sciences']",
"['1. Natural Sciences' '5. Social Sciences']",
"['3. Medical and Health Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
"['2. Engineering and Technology' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
"['4. Agricultural Sciences' '1. Natural Sciences']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['5. Social Sciences' '1. Natural Sciences']",
"['2. Engineering and Technology' '6. Humanities']",
"['2. Engineering and Technology' '4. Agricultural Sciences']",
"['5. Social Sciences' '4. Agricultural Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences'\n '1. Natural Sciences']",
"['6. Humanities' '3. Medical and Health Sciences']",
"['6. Humanities' '5. Social Sciences']",
"['5. Social Sciences' '3. Medical and Health Sciences']",
"['5. Social Sciences' '2. Engineering and Technology']",
"['5. Social Sciences' '6. Humanities']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
"['2. Engineering and Technology' '4. Agricultural Sciences'\n '1. Natural Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology' '6. Humanities']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '5. Social Sciences']",
"['4. Agricultural Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '3. Medical and Health Sciences']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
"['4. Agricultural Sciences' '5. Social Sciences']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '5. Social Sciences']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '4. Agricultural Sciences']",
"['2. Engineering and Technology' '5. Social Sciences'\n '1. Natural Sciences']",
"['1. Natural Sciences' '6. Humanities' '3. Medical and Health Sciences']",
"['4. Agricultural Sciences' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences' '6. Humanities'\n '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '4. Agricultural Sciences']",
"['6. Humanities' '4. Agricultural Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '5. Social Sciences' '2. Engineering and Technology']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '5. Social Sciences']",
"['6. Humanities' '2. Engineering and Technology' '1. Natural Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '4. Agricultural Sciences']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '5. Social Sciences']",
"['3. Medical and Health Sciences' '6. Humanities'\n '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '1. Natural Sciences' '6. Humanities']",
"['2. Engineering and Technology' '6. Humanities' '1. Natural Sciences']",
"['3. Medical and Health Sciences' '4. Agricultural Sciences'\n '1. Natural Sciences']",
"['1. Natural Sciences' '5. Social Sciences'\n '2. Engineering and Technology']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology' '4. Agricultural Sciences']",
"['1. Natural Sciences' '3. Medical and Health Sciences' '6. Humanities']",
"['1. Natural Sciences' '5. Social Sciences'\n '3. Medical and Health Sciences']",
"['6. Humanities' '3. Medical and Health Sciences' '1. Natural Sciences']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '4. Agricultural Sciences']",
"['1. Natural Sciences' '6. Humanities' '2. Engineering and Technology']",
"['4. Agricultural Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
"['5. Social Sciences' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
"['6. Humanities' '1. Natural Sciences' '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '6. Humanities' '1. Natural Sciences']",
"['5. Social Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['2. Engineering and Technology' '4. Agricultural Sciences'\n '6. Humanities']",
"['1. Natural Sciences' '5. Social Sciences'\n '2. Engineering and Technology' '3. Medical and Health Sciences']",
"['6. Humanities' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['6. Humanities' '5. Social Sciences' '1. Natural Sciences']",
"['5. Social Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
"['6. Humanities' '1. Natural Sciences' '3. Medical and Health Sciences']",
"['5. Social Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
"['6. Humanities' '1. Natural Sciences' '4. Agricultural Sciences']",
"['1. Natural Sciences' '4. Agricultural Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences' '2. Engineering and Technology']",
"['6. Humanities' '3. Medical and Health Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '5. Social Sciences' '1. Natural Sciences']",
"['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']",
"['4. Agricultural Sciences' '2. Engineering and Technology'\n '6. Humanities']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '5. Social Sciences' '2. Engineering and Technology']",
"['2. Engineering and Technology' '1. Natural Sciences' '6. Humanities']",
"['3. Medical and Health Sciences' '6. Humanities'\n '2. Engineering and Technology' '5. Social Sciences'\n '1. Natural Sciences']",
"['4. Agricultural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology' '1. Natural Sciences']",
"['3. Medical and Health Sciences' '2. Engineering and Technology'\n '4. Agricultural Sciences' '1. Natural Sciences']",
"['2. Engineering and Technology' '5. Social Sciences'\n '3. Medical and Health Sciences']",
"['3. Medical and Health Sciences' '4. Agricultural Sciences'\n '2. Engineering and Technology']",
"['6. Humanities' '2. Engineering and Technology'\n '3. Medical and Health Sciences']",
"['4. Agricultural Sciences' '6. Humanities']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '4. Agricultural Sciences' '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '2. Engineering and Technology'\n '6. Humanities']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '3. Medical and Health Sciences' '4. Agricultural Sciences']",
"['1. Natural Sciences' '6. Humanities' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '1. Natural Sciences' '5. Social Sciences']",
"['2. Engineering and Technology' '5. Social Sciences' '6. Humanities']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '3. Medical and Health Sciences' '6. Humanities']",
"['1. Natural Sciences' '4. Agricultural Sciences' '5. Social Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences'\n '2. Engineering and Technology']",
"['5. Social Sciences' '1. Natural Sciences' '4. Agricultural Sciences']",
"['2. Engineering and Technology' '3. Medical and Health Sciences'\n '1. Natural Sciences' '5. Social Sciences']",
"['2. Engineering and Technology' '6. Humanities'\n '3. Medical and Health Sciences' '1. Natural Sciences']",
"['5. Social Sciences' '6. Humanities' '3. Medical and Health Sciences']",
"['2. Engineering and Technology' '6. Humanities'\n '3. Medical and Health Sciences']",
"['5. Social Sciences' '6. Humanities' '1. Natural Sciences']"],
dtype=object)
Maybe a MWE to reproduce the data is this:
# initialize list of lists
data = [[1, [7,3], [1,1], "['5. Social Sciences' '6. Humanities' '1. Natural Sciences']"], [2, [1,5], [2,1], "['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']"], [3, [1,2,6], [2,0,2], '[nan]'],[5, [1,2], [2,0], "['1. Natural Sciences']"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_patents','dist_cited_patents','Fields'])
Now, what I would like to obtain is actually two variables: one is a list of strings (new_var1 ) and the other one is a simple string (new_var2 in the next result db) like this:
docdb. cited_patents dist_cited_patents new_var1
1 [7, 3] [1, 1] ['Social Sciences','Humanities','Natural Sciences']
2 [1, 5] [2, 1] ['Engineering and Technology','Social Sciences', 'Agricultural Sciences']
3 [1, 2, 6] [2, 0, 2] []
5 [1, 2] [2, 0] ['Natural Sciences']
For what concerns var2
docdb. cited_patents dist_cited_patents new_var2
1 [7, 3] [1, 1] 'Social Sciences, Humanities, Natural Sciences'
2 [1, 5] [2, 1] 'Engineering and Technology,Social Sciences,Agricultural Sciences'
3 [1, 2, 6] [2, 0, 2] ''
5 [1, 2] [2, 0] 'Natural Sciences'
Thanks a lot
CodePudding user response:
I would use a regex pattern to find all the matching occurrences:
df['new_var1'] = df['Fields'].str.findall(r"'\d \.\s*(.*?)'")
df['new_var2'] = df['new_var1'].str.join(', ')
docdb cited_patents dist_cited_patents Fields new_var1 new_var2
0 1 [7, 3] [1, 1] ['5. Social Sciences' '6. Humanities' '1. Natural Sciences'] [Social Sciences, Humanities, Natural Sciences] Social Sciences, Humanities, Natural Sciences
1 2 [1, 5] [2, 1] ['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences'] [Engineering and Technology, Social Sciences, Agricultural Sciences] Engineering and Technology, Social Sciences, Agricultural Sciences
2 3 [1, 2, 6] [2, 0, 2] [nan] []
3 5 [1, 2] [2, 0] ['1. Natural Sciences'] [Natural Sciences] Natural Sciences