I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1
, thing2
, and thing3
are in each line using cellfun
to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3
is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3
is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX'
and then tries to match :
followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp
as regexp(___,'tokens','once')
to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(@(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false
. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(@(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID
has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out
with the string Unknown
. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out)
as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
- Is there a way to accomplish what I'm trying to do without using
'UniformOutput',false
? For example, is there a way to haveregexp
pass a custom string if there is no match (e.g. pass'Unknown'
if there is no match)? I can think of one 'cheat', which would be to use the|
operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to'Unknown'
. - If I take the
'UniformOutput',false
approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it throughunique
)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays. - If there is some overall different approach that I'm not thinking of, I'm also open to it.
- Tangential to the main question, I also tried using a single expression to run
regexp
using 3 tokens to pull out the values ofthing1
,thing2
, andthing3
in one pass. This seems to require'UniformOutput',false
even when there are no empty results fromregexp
. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp
sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out
so that the contents of out{1}{1}
and out{3}{1}
are instead just out{1}
and out{3}
. But I'm not sure how to accomplish this if I need 'UniformOutput',false
.
CodePudding user response:
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch'
option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the |
character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
CodePudding user response:
As a workaround, I have found that using regexprep
can be used to find entries where thing3
is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(@(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(@(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.