Matt Bourke

Jun 6 2010

Searching code with regex

Posted by Matt at 2:15 PM
7 comments
- Categories: regex | Coldfusion

Ever had trouble finding where a variable is set, displayed or manipulated throughout a large application?
I can remember often trying to find code before having any knowledge of regex. An example of a problem I used to face several years ago is the following.
I would inherit a large mess of an application and in the code I would find a variable #abc123#, not knowing where this is set I would do a search on the following string using Eclipse
<cfset abc123
I expected that this would find the variable, however when I eventually found it, I would often find that it would be written something like
<cfset  abc123 (notice the 2 spaces between cfset and abc.)

 

Another example would be I wanted to search for where abc123 is set to session.crap so I'd search 
<cfset abc123 = session.crap >
and 
<cfset abc123 =  session.crap (notice the 2 spaces)
I'd have no luck, I'd find that one developer would write code like
<cfset abc123 = session.crap >
and the other
<cfset abc123=session.crap> (notice the lack of spacing)

 

I've worked on some nasty apps over the years and have lost my fair share of hair hunting stuff down. These days I simply use regex whenever I search through code.

 

Now, lets get started on a 5 minute tutorial, open Eclipse hit CTRL + F, tick the regular expressions check box, enter the following regex <cfset[\t\s]+abc123[\t\s]*?\=(.*?)\> and it should look like below.

 

regex search

Now if you hit search it should find where the variable is being set on your testing page, however most likely you'll want to search the entire site, so use that search screen.

 

To modify this for a variable name simply replace the text "abc123" in the above regex with the name of your variable.
Lets consider the following scenario.
In your site you have a variable session.crap, you want to know all the locations of where its set in the giant messy application you've just inherited, so you search on session.crap and the search returns 500 results in 150 files, instantly you realise 2 things.
1, this codebase contains a lot of crap
2, the session has crap all through it (pun intended).

Some of the results for this variable in the search are

 

<cfset crud = session.crap + mrVariable />
#session.crap#
<cfset session.crap.crud.fcku.whacked = "yo!" />
<cfscript>
session.crap = 1 + 1;
session.crap= session.crap
session.crap = session.crap
</cfscript>

 

You get my point, there could be a bazillion different ways this could appear if you simply search on the name or search for <cfset session.crap which of course could have an unknown amount of whitespace.
The first step of the hunt I would try would be 
<cfset[\t\s]+session\.crap[\t\s]*?\=(.*?)\>
and for cfscript the below should do.
[\t\s]+session\.crap[\t\s]*?\=(.*?);

 

Remember to escape the Dot with a \ otherwise the dot will match any character
e.g sessionscrap would be found just as session.crap would (for more info see my previous blog tutorial). The above of course isn't perfect but it is one of the many ways to save time without having to run code and debug etc.

 

Regex can be used in many ways, lets say we wanted to simply know how many cfsets are in our code base we could run
<cfset[\t\s]+[a-zA-Z_]+[\w\.]+[\t\s]*?\=(.*?)\> (of course we could just search on cfset)
But then what if wanted to know how many of these cfsets are in an xml compliant format (for whatever reason).
<cfset[\t\s]+[a-zA-Z_]+[\w\.]+[\t\s]*?\=(.*?)\/\>
will accomplish this to a degree.

 

If I was assigned with cleaning up a large code base I could search through and see what variables have been scoped for example
<cfset[\t\s]+[a-zA-Z_]+[\w\.]+\.[a-zA-Z_]+[\w\.]+[\t\s]*?\=(.*?)\>
or scoped and xml compliant
<cfset[\t\s]+[a-zA-Z_]+[\w\.]+\.[a-zA-Z_]+[\w\.]+[\t\s]*?\=(.*?)\/\>

 

If I wanted to follow some basic OO principles I could start by making sure my cfc's are encapsulated, a simple example would be to search through all cfc files for variables belonging to the request scope.
<cfset[\t\s]+request\.[a-zA-Z_]+[\w\.]+[\t\s]*?\=(.*?)\>
By now you know what to do for searching Application variables, simply change the "request" above to "application".

 

Some further examples of regex for searching through/counting CF code.
#[a-zA-Z_]+[\w\.]+\.[a-zA-Z_]+[\w\.]+# Output Scoped variables
#[a-zA-Z_]+[\w]+# Output unscoped variables

 

Searching for variables through code can be much faster than debugging etc, I hope others can use some of the above regular expressions in their daily development, if you have some helpful regex you use often or if you think you can improve on my regex, then please post it below.

 


Comments


Tony

Tony wrote on 06/07/10 7:38 AM

Looks like great stuff Mat. Just what I was looking for.

Just one thing. We have have some really 'unique' coding styles to deal with. One of my main bugbears is unscoped variables. So someone will call #randomVar# somewhere in the code. But it will actually be coming from soming like #session.randomVar#. Can you update your example to show how you might be able to ignore any scoping of the variable names.

Eric Cobb

Eric Cobb wrote on 06/07/10 8:16 AM

Thank you!!! You have no idea how much easier you just made my life!

Matt

Matt wrote on 06/07/10 8:32 AM

Hi Tony,
If I understand you correctly the following should be good.
<cfset[\t\s]+([a-zA-Z_]+[\w\.]+\.)?myvar[\t\s]*?\=(.*?)\>

Just replace myvar with what ever your variable is.
<cfset[\t\s]+([a-zA-Z_]+[\w\.]+\.)?myvar(\.[a-zA-Z_]+[\w\.]+)?[\t\s]*?\=(.*?)\>
Will also find <cfset url.myvar.test = x > incase you wanted to go deeper.

let me know if I've understood you correctly


Matt

Matt wrote on 06/07/10 8:44 AM

Hi Eric,

Great to hear I've helped someone, be sure to also use the one I just wrote above for Tony, it will find the first 3 of the below but not the last.
<cfset session.myvar = "hey">
<cfset url.myvar = "hey">
<cfset myvar = "hey">
<cfset session.whacked = "hey">

Tony

Tony wrote on 06/07/10 8:49 AM

Nice work Mat, thanks for all your help.

Neil

Neil wrote on 06/10/10 3:44 AM

Hi Matt
I tried out the regex in my code and it worked fine. But do we need the "\t" bit? I tried it without that and it still returned matches.

Matt

Matt wrote on 06/10/10 3:53 AM

Hi Neil,
you're correct, all that is needed is the \s, the \t was for tabs, however in most implementations the \s will cover tabs, so feel free to remove the \t
cheers

Write your comment



(it will not be displayed)



Leave this field empty:

















 
 
© 2010 Matt Bourke