My Regular Expression Looks Like This (IT Toolbox Blogs)

Oracle has two features that make me doubt my capabilities to understand. One is the MODEL clause. The other is “regular expressions”: both are compact and cryptic… rather remind me of the APL language I studied for a couple of weeks back in college: Conway’s Game of Life in one line of code? Sure, APL can do that! Duke’s understanding of the code in one-million years of study… maybe…


At any rate, a question on AskTom caught my eye:


I have a requirement to extract column names (there could be multiple instances) from a SQL string. The column names will always be enclosed within double quotes. e.g. “le code” = 802 AND “principal amount” > 1200 OR “branch id” = ‘ABC’


Connor McDonald proposed a PL/SQL pipelined-table function for that. Just for giggles, I tried his function as a WITH-function, since I don’t have CREATE authority in production. No dice, at least in 12c.


Rajeshwaran, Jeyabal, a notable contributor at the site, followed up with the question “Any specific reason here to avoid “REGEX” and switching into Pipelined Table functions”? He also provided a quick example:


demo@ORA11G> variable l_str varchar2(100)
demo@ORA11G> exec :l_str := q'| "le code" = 802 AND "principal amount" > 1200 OR "branch id" = 'ABC' |';

PL/SQL procedure successfully completed.

demo@ORA11G> print l_str

 "le code" = 802 AND "principal amount" > 1200 OR "branch id" = 'ABC'

demo@ORA11G> select trim('"' from regexp_substr( :l_str, '"\S+\s+\S+"',1,level))  x
  2  from dual
  3  connect by level <= regexp_count( :l_str, '"\S+\s+\S+"' ) ;

le code
principal amount
branch id


Huh. I had to learn REGEX (regular expressions) when I took the Oracle SQL Expert exam in 2010. I’ve had zero use for it since.


Yet, REGEX is a higher abstraction than most SQL functions. And I do believe that it’s good to break out of the rut, as Edward De Bono argued in a book of his I read many years ago.



I suspected that REGEX are just slow based on my limited experience, but that was long ago and versions far away. Certainly slowness would be difficult to demonstrate using the one SQL snippet provided by the original questioner.


So my first question was: how do I come up with reasonably large set of test data? Easily, I answered myself, there are lots of SQL strings in Oracle’s ALL_VIEWS. 


At first, I started thrashing through converting longs to variable character… but then I remembered that Oracle’s already done that in 12c for DBA_VIEWS. In addition to the TEXT column (which is a LONG data type) there’s also TEXT_VC which is varchar2(4000). Probably truncated, but there are thousand of views owned by SYS, so that should be an ample and realistic data set. 



I didn’t want to just say “REGEX is slow”.  There had to be some basis of comparison. So I had to come up with a non-REGEX query. Here’s what I did, using recursive SQL:


WITH setup2 AS (
select view_name
     , text_vc
     , TRUNC((length(text_vc)-length(replace(text_vc,'"'))) / 2 ) as cnt
  from dba_views_copy
 where owner = 'SYS'
), rCTE2 ( view_name, text_vc, cnt, column_name, occurrence#, pp ) AS (
select view_name
     , text_vc
     , cnt
     , ''
     , 1
     , instr(text_vc,'"',1) pp
  FROM setup2
union all
select view_name
     , text_vc
     , cnt
     , case when mod(occurrence#+1,2) = 0 
             and instr(text_vc,'"',1,occurrence#+1)-pp+1 <= 30
            then substr(text_vc,pp,instr(text_vc,'"',1,occurrence#+1)-pp+1)
        end as column_name
     , occurrence#+1
     , instr(text_vc,'"',1,occurrence#+1) pp
  from rCTE2
 where occurrence# <= cnt*2
) select count(*), count(unique view_name)
    , count(unique column_name)
 from rCTE2 where column_name is not null;

-------- ---------------------- ------------------------
  43,003                   3206                     8753

Elapsed: 00:00:05.10


My REGEX version was slower, but much easier to write. In fact, I used it to debug my “conventional query”:


WITH setup AS (
SELECT view_name, text_vc, regexp_count( text_vc, '"[^"]{1,30}"' ) cnt
  FROM dba_views_copy
 WHERE owner = 'SYS'
), rCTE ( view_name, text_vc, cnt, column_name, column_id ) AS (
SELECT view_name
     , text_vc
     , cnt
     , regexp_substr( text_VC, '"[^"]{1,30}"', 1, 1) as column_name
     , 1
  FROM setup
SELECT view_name
     , text_vc
     , cnt
     , regexp_substr( text_VC, '"[^"]{1,30}"'
            , 1, column_id + 1) as column_name
     , column_id + 1
 WHERE column_id <= cnt
) select count(*)
    , count(unique view_name), count(unique column_name)
 from rCTE
 where column_name is not null;

------- ---------------------- ------------------------
 43,232                   3223                     8872

Elapsed: 00:00:07.55


Sharp-eyed readers will note at least two things:

1. REGEX is only slightly slower

2. The answers aren’t exactly the same.

The latter caught my eye…



Why are there are different results? Here’s a snippet from one of the views with different results:


'xmlns="" xmlns:x=""'


My code looked at “” and knew it was over thirty characters, so it skipped to “”. The output was “”.


The regular expression just found this first match: “xmlns:x=”. Right; my code looked for matched-pairs-of-(“) then tested for size. The regular expression inched along, and found the first pattern. 


It’s an implementation detail, eh?





Here’s fun demonstration of how reasonable the naive results are. I’ll use the results from my conventional query as the example:


SQL> create table ""
  2  ( "" varchar2(35) );

Table created.

Elapsed: 00:00:00.09
SQL> insert into ""
  2  values ( '""');

1 row created.

Elapsed: 00:00:00.10
SQL> select * from "";


Cool, eh?  




BTW, Connor’s pipelined table-function was lots faster than either “pure SQL” approach.

Oracle 12c

Source: SANS ISC SecNewsFeed @ February 16, 2017 at 02:06PM