Home > Blockchain >  Re-arrange string in pandas
Re-arrange string in pandas

Time:10-21

I have a column in a dataframe whose type is object. This column contains both arrays and strings and its unique values are these:

array(['[nan]', "['3. Medical and Health Sciences']",
       "['1. Natural Sciences']", "['2. Engineering and Technology']",
       "['1. Natural Sciences' '3. Medical and Health Sciences']",
       "['3. Medical and Health Sciences' '1. Natural Sciences']",
       "['5. Social Sciences']",
       "['1. Natural Sciences' '4. Agricultural Sciences'\n '3. Medical and Health Sciences']",
       "['1. Natural Sciences' '6. Humanities']",
       "['4. Agricultural Sciences']",
       "['1. Natural Sciences' '2. Engineering and Technology']",
       "['2. Engineering and Technology' '1. Natural Sciences']",
       "['6. Humanities']",
       "['3. Medical and Health Sciences' '6. Humanities']",
       "['4. Agricultural Sciences' '3. Medical and Health Sciences']",
       "['3. Medical and Health Sciences' '2. Engineering and Technology']",
       "['3. Medical and Health Sciences' '4. Agricultural Sciences']",
       "['2. Engineering and Technology' '3. Medical and Health Sciences']",
       "['1. Natural Sciences' '4. Agricultural Sciences']",
       "['3. Medical and Health Sciences' '5. Social Sciences']",
       "['4. Agricultural Sciences' '2. Engineering and Technology']",
       "['6. Humanities' '1. Natural Sciences']",
       "['6. Humanities' '2. Engineering and Technology']",
       "['2. Engineering and Technology' '5. Social Sciences']",
       "['1. Natural Sciences' '5. Social Sciences']",
       "['3. Medical and Health Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
       "['2. Engineering and Technology' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
       "['4. Agricultural Sciences' '1. Natural Sciences']",
       "['3. Medical and Health Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
       "['5. Social Sciences' '1. Natural Sciences']",
       "['2. Engineering and Technology' '6. Humanities']",
       "['2. Engineering and Technology' '4. Agricultural Sciences']",
       "['5. Social Sciences' '4. Agricultural Sciences']",
       "['3. Medical and Health Sciences' '5. Social Sciences'\n '1. Natural Sciences']",
       "['6. Humanities' '3. Medical and Health Sciences']",
       "['6. Humanities' '5. Social Sciences']",
       "['5. Social Sciences' '3. Medical and Health Sciences']",
       "['5. Social Sciences' '2. Engineering and Technology']",
       "['5. Social Sciences' '6. Humanities']",
       "['1. Natural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
       "['2. Engineering and Technology' '4. Agricultural Sciences'\n '1. Natural Sciences']",
       "['1. Natural Sciences' '2. Engineering and Technology' '6. Humanities']",
       "['3. Medical and Health Sciences' '1. Natural Sciences'\n '5. Social Sciences']",
       "['4. Agricultural Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
       "['1. Natural Sciences' '2. Engineering and Technology'\n '3. Medical and Health Sciences']",
       "['2. Engineering and Technology' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
       "['4. Agricultural Sciences' '5. Social Sciences']",
       "['2. Engineering and Technology' '1. Natural Sciences'\n '5. Social Sciences']",
       "['2. Engineering and Technology' '1. Natural Sciences'\n '4. Agricultural Sciences']",
       "['2. Engineering and Technology' '5. Social Sciences'\n '1. Natural Sciences']",
       "['1. Natural Sciences' '6. Humanities' '3. Medical and Health Sciences']",
       "['4. Agricultural Sciences' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
       "['3. Medical and Health Sciences' '5. Social Sciences' '6. Humanities'\n '2. Engineering and Technology']",
       "['3. Medical and Health Sciences' '1. Natural Sciences'\n '4. Agricultural Sciences']",
       "['6. Humanities' '4. Agricultural Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
       "['1. Natural Sciences' '3. Medical and Health Sciences'\n '5. Social Sciences' '2. Engineering and Technology']",
       "['1. Natural Sciences' '2. Engineering and Technology'\n '5. Social Sciences']",
       "['6. Humanities' '2. Engineering and Technology' '1. Natural Sciences']",
       "['1. Natural Sciences' '2. Engineering and Technology'\n '4. Agricultural Sciences']",
       "['1. Natural Sciences' '3. Medical and Health Sciences'\n '5. Social Sciences']",
       "['3. Medical and Health Sciences' '6. Humanities'\n '2. Engineering and Technology']",
       "['3. Medical and Health Sciences' '1. Natural Sciences' '6. Humanities']",
       "['2. Engineering and Technology' '6. Humanities' '1. Natural Sciences']",
       "['3. Medical and Health Sciences' '4. Agricultural Sciences'\n '1. Natural Sciences']",
       "['1. Natural Sciences' '5. Social Sciences'\n '2. Engineering and Technology']",
       "['1. Natural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology' '4. Agricultural Sciences']",
       "['1. Natural Sciences' '3. Medical and Health Sciences' '6. Humanities']",
       "['1. Natural Sciences' '5. Social Sciences'\n '3. Medical and Health Sciences']",
       "['6. Humanities' '3. Medical and Health Sciences' '1. Natural Sciences']",
       "['1. Natural Sciences' '3. Medical and Health Sciences'\n '4. Agricultural Sciences']",
       "['1. Natural Sciences' '6. Humanities' '2. Engineering and Technology']",
       "['4. Agricultural Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
       "['5. Social Sciences' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
       "['6. Humanities' '1. Natural Sciences' '2. Engineering and Technology']",
       "['3. Medical and Health Sciences' '6. Humanities' '1. Natural Sciences']",
       "['5. Social Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
       "['2. Engineering and Technology' '4. Agricultural Sciences'\n '6. Humanities']",
       "['1. Natural Sciences' '5. Social Sciences'\n '2. Engineering and Technology' '3. Medical and Health Sciences']",
       "['6. Humanities' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
       "['4. Agricultural Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
       "['6. Humanities' '5. Social Sciences' '1. Natural Sciences']",
       "['5. Social Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
       "['6. Humanities' '1. Natural Sciences' '3. Medical and Health Sciences']",
       "['5. Social Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
       "['6. Humanities' '1. Natural Sciences' '4. Agricultural Sciences']",
       "['1. Natural Sciences' '4. Agricultural Sciences'\n '2. Engineering and Technology']",
       "['4. Agricultural Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences' '2. Engineering and Technology']",
       "['6. Humanities' '3. Medical and Health Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
       "['4. Agricultural Sciences' '5. Social Sciences' '1. Natural Sciences']",
       "['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']",
       "['4. Agricultural Sciences' '2. Engineering and Technology'\n '6. Humanities']",
       "['3. Medical and Health Sciences' '1. Natural Sciences'\n '5. Social Sciences' '2. Engineering and Technology']",
       "['2. Engineering and Technology' '1. Natural Sciences' '6. Humanities']",
       "['3. Medical and Health Sciences' '6. Humanities'\n '2. Engineering and Technology' '5. Social Sciences'\n '1. Natural Sciences']",
       "['4. Agricultural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology' '1. Natural Sciences']",
       "['3. Medical and Health Sciences' '2. Engineering and Technology'\n '4. Agricultural Sciences' '1. Natural Sciences']",
       "['2. Engineering and Technology' '5. Social Sciences'\n '3. Medical and Health Sciences']",
       "['3. Medical and Health Sciences' '4. Agricultural Sciences'\n '2. Engineering and Technology']",
       "['6. Humanities' '2. Engineering and Technology'\n '3. Medical and Health Sciences']",
       "['4. Agricultural Sciences' '6. Humanities']",
       "['1. Natural Sciences' '3. Medical and Health Sciences'\n '4. Agricultural Sciences' '2. Engineering and Technology']",
       "['3. Medical and Health Sciences' '2. Engineering and Technology'\n '6. Humanities']",
       "['2. Engineering and Technology' '1. Natural Sciences'\n '3. Medical and Health Sciences' '4. Agricultural Sciences']",
       "['1. Natural Sciences' '6. Humanities' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
       "['4. Agricultural Sciences' '1. Natural Sciences' '5. Social Sciences']",
       "['2. Engineering and Technology' '5. Social Sciences' '6. Humanities']",
       "['1. Natural Sciences' '2. Engineering and Technology'\n '3. Medical and Health Sciences' '6. Humanities']",
       "['1. Natural Sciences' '4. Agricultural Sciences' '5. Social Sciences']",
       "['3. Medical and Health Sciences' '5. Social Sciences'\n '2. Engineering and Technology']",
       "['5. Social Sciences' '1. Natural Sciences' '4. Agricultural Sciences']",
       "['2. Engineering and Technology' '3. Medical and Health Sciences'\n '1. Natural Sciences' '5. Social Sciences']",
       "['2. Engineering and Technology' '6. Humanities'\n '3. Medical and Health Sciences' '1. Natural Sciences']",
       "['5. Social Sciences' '6. Humanities' '3. Medical and Health Sciences']",
       "['2. Engineering and Technology' '6. Humanities'\n '3. Medical and Health Sciences']",
       "['5. Social Sciences' '6. Humanities' '1. Natural Sciences']"],
      dtype=object)

Maybe a MWE to reproduce the data is this:

# initialize list of lists
data = [[1, [7,3], [1,1], "['5. Social Sciences' '6. Humanities' '1. Natural Sciences']"], [2, [1,5], [2,1], "['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']"], [3, [1,2,6], [2,0,2], '[nan]'],[5, [1,2], [2,0], "['1. Natural Sciences']"]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_patents','dist_cited_patents','Fields'])

Now, what I would like to obtain is actually two variables: one is a list of strings (new_var1 ) and the other one is a simple string (new_var2 in the next result db) like this:

docdb. cited_patents dist_cited_patents new_var1
1           [7, 3]  [1, 1]              ['Social Sciences','Humanities','Natural Sciences']
2           [1, 5]  [2, 1]              ['Engineering and Technology','Social Sciences', 'Agricultural Sciences']
3         [1, 2, 6] [2, 0, 2]           []
5          [1, 2]   [2, 0]              ['Natural Sciences']

For what concerns var2

docdb. cited_patents dist_cited_patents new_var2
1           [7, 3]  [1, 1]              'Social Sciences, Humanities, Natural Sciences'
2           [1, 5]  [2, 1]              'Engineering and Technology,Social Sciences,Agricultural Sciences'
3         [1, 2, 6] [2, 0, 2]           ''
5          [1, 2]   [2, 0]              'Natural Sciences'

Thanks a lot

CodePudding user response:

I would use a regex pattern to find all the matching occurrences:

df['new_var1'] = df['Fields'].str.findall(r"'\d \.\s*(.*?)'")
df['new_var2'] = df['new_var1'].str.join(', ')

   docdb cited_patents dist_cited_patents                                                                               Fields                                                              new_var1                                                            new_var2
0      1        [7, 3]             [1, 1]                         ['5. Social Sciences' '6. Humanities' '1. Natural Sciences']                       [Social Sciences, Humanities, Natural Sciences]                       Social Sciences, Humanities, Natural Sciences
1      2        [1, 5]             [2, 1]  ['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']  [Engineering and Technology, Social Sciences, Agricultural Sciences]  Engineering and Technology, Social Sciences, Agricultural Sciences
2      3     [1, 2, 6]          [2, 0, 2]                                                                                [nan]                                                                    []                                                                    
3      5        [1, 2]             [2, 0]                                                              ['1. Natural Sciences']                                                    [Natural Sciences]                                                    Natural Sciences
  • Related