I have been given some data in a text format that I would like to convert into a dataframe:

text <- "
VALUE Ethnic
1 = 'White - British'
2 = 'White - Irish'
9 = 'White - Other'
;
"

I'm looking to convert into a dataframe with a column for the first number and a column for the test in the string. So - in this case, it would be two columns and three rows.

CodePudding user response：

library(tidyr)
library(dplyr)
tibble(text = trimws(text)) %>% 
  separate_rows(text, sep = "\n") %>%
  filter(text != ";") %>% 
  slice(-1) %>% 
  separate(text, into = c("VALUE", "Ethnic"), sep = "\\s =\\s ")

-output

# A tibble: 3 × 2
  VALUE Ethnic           
  <chr> <chr>            
1 1     'White - British'
2 2     'White - Irish'  
3 9     'White - Other'

Or in base R

read.table(text = gsub("=", " ", trimws(text, 
  whitespace = "\n(;\n)*"), fixed = TRUE), header = TRUE)
  VALUE          Ethnic
1     1 White - British
2     2   White - Irish
3     9   White - Other

CodePudding user response：

create the years list

years_list = list(range(1986,2020))

defines the columns separation specified in the layout

columns_width = [(0,2),(2,10),(10,12),(12,24),(24,27),(27,39),(39,49),(49,52),(52,56),(56,69),(69,82), (82,95),(95,108),(108,121),(121,134),(134,147),(147,152),(152,170),(170,188),(188,201), (201,202),(202,210),(210,217),(217,230),(230,242),(242,245)]

defines the english transleted columns according to the layout

columns_header = ['Register Type','Trading Date','BDI Code','Negociation Code','Market Type','Trade Name', 'Specification','Forward Market Term In Days','Currency','Opening Price','Max. Price', 'Min. Price','Mean Price','Last Trade Price','Best Purshase Order Price', 'Best Purshase Sale Price','Numbor Of Trades','Number Of Traded Stocks', 'Volume Of Traded Stocks','Price For Options Market Or Secondary Term Market', 'Price Corrections For Options Market Or Secondary Term Market', 'Due Date For Options Market Or Secondary Term Market','Factor Of Paper Quotatuion', 'Points In Price For Options Market Referenced In Dollar Or Secondary Term', 'ISIN Or Intern Code ','Distribution Number']

create a empty df that will be filled during the iteration below

years_concat = pd.DataFrame()

iterate all years

for year in years_list:

time_serie = pd.read_fwf('/kaggle/input/bmfbovespas-time-series-19862019/COTAHIST_A' str(year) '.txt', 
                     header=None, colspecs=columns_width)

# delete the first and the last lines containing identifiers
# use two comented lines below to see them
# output = pd.DataFrame(np.array([time_serie.iloc[0],time_serie.iloc[-1]]))
# output
time_serie = time_serie.drop(time_serie.index[0])
time_serie = time_serie.drop(time_serie.index[-1])
years_concat = pd.concat([years_concat,time_serie],ignore_index=True)

years_concat.columns = columns_header