Isolating File Extensions Using Regular Expressions In SQL Server 2000
One of our our older applications is still running on SQL Server
2000 -- Yes, I know, it's three revisions and nine years behind but
I am told there is a valid reason why it hasn't been upgraded yet.
And no. I don't know what that reason is -- While the application
has it's own dedicated support team I am sometimes called upon to
lend assistance when an issue requires a more experienced skill
set. Today was one of those days.
One of the database tables is used to store virtual references
to files that end users have uploaded. Unfortunately there is no
validation applied to the filenames and users are free to upload
whatever they want. Although this is fine for files being uploaded
it can present problems for subsequent GETs. If a file has a badly
named extension then no appropriate mimetype can be determined.
With no appropriate mimetype a web browser will simply present a
download dialog box instead of opening the file with a registered
handler or plug-in. It should be noted that this is not the fault
of the end user. To them it makes perfect sense to name a video
file named mymovie.mov as mymovie.mov copy.
My main task was to parse hundreds of thousands of table rows to
get a count of all the unique file extensions in use. With that
information we could determine the scope of the problem and
prioritize the work effort required to fix it. My first thought of
course was to simply use the CLR to create a wrapper around a
regular expression. Facepalm! This application is running on SQL
Server 2000, ergo no .NET, ergo no CLR, ergo "now what?"
I had a faint recollection of doing something almost a decade
ago with SQL Server 2000 and COM Automation. After some digging
through Google as well as my own memory I came across sp_OACreate() and
the companion sp_OASetProperty(),
sp_OAMethod() and
sp_OADestroy().
These stored procedures allow communication with COM and
subsequently the VBScript Regular Expression object. Creating an
instance of RegExp in a stored procedure using dynamic SQL is as
simple as
1:
DECLARE @ObjectId Int
2:
DECLARE @ReturnVal Int
3:
EXEC @ReturnVal = sp_OACreate 'VBScript.RegExp', @ObjectId OUT
There exists many examples on the web for implementing Regular
Expressions this way. I rolled my own proof of concept so that I
made sure I fully understood how the sp_OA* stored procedures work
then resorted to Google to locate a finished solution. See
http://www.simple-talk.com/sql/t-sql-programming/tsql-regular-expression-workbench/
for a great implementation of User Defined Functions (UDF)
supporting matching, finding and replacing.
Once the UDFs were added to the SQL Server 2000 database it was
time to work on the query that would return all the unique
extensions. That was accomplished with a simple cursor that called
RegExFind on each iteration. The RegEx pattern was \.[\w\s\d]*.$
and it was applied to the file path obtained from the table
containing the file references. The query is reproduced below for
anyone interested.
The pattern \.[\w\s\d]*.$ means match all characters between the
last period in the path (assumed to be the extension delimiter) and
the end of the line. Referring to the example earlier this would
match the .mov in mymovie.mov as well as the .mov copy in
mymovie.mov copy.
1:
DROP
TABLE
2: #Extensions
3:
4:
CREATE
TABLE
5: #Extensions (
6: Extension VARCHAR(255)
7: )
8:
9:
DECLARE MY_CURSOR CURSORFOR
10:
SELECT
11: filePath
12:
FROM
13: FilesTable
14:
WHERE
15: FilesTable.[Type] = 10
16:
AND filePath ISNOTNULL
17:
18:
OPEN My_Cursor
19:
DECLARE @VAR1 VARCHAR(255)
20:
DECLARE @Ext VARCHAR(50)
21:
22:
FETCH
NEXT
FROM
23: MY_Cursor
24:
INTO
25: @VAR1
26:
WHILE(@@FETCH_STATUS <> -1)
27:
BEGIN
28:
IF(@@FETCH_STATUS <> -2)
29:
SET @Ext = (
30:
SELECT
31:
TOP 1 value
32:
FROM
33: dbo.regexfind('\.[\w\s\d]*.$', @Var1, 1, 1)
34:
WHERE (
35: LENGTH > 0
36: )
37: )
38:
IF (
39:
SELECT
40: Extension
41:
FROM
42: #Extensions
43:
WHERE (
44: Extension = @Ext
45: )
46: ) ISNULLBEGIN
47: INSERT INTO
48: #Extensions (
49: Extension
50: )
51:
VALUES (
52: @Ext
53: )
54:
END
55:
FETCH
NEXT
FROM
56: MY_CURSOR
57:
INTO
58: @VAR1
59:
END
60:
CLOSE MY_CURSOR
61:
DEALLOCATE MY_CURSOR
62:
63:
SELECT
64:
DISTINCT Extension
65:
FROM
66: #Extensions
67:
Now that the analysis query has been run we are evaluating the
results to determine the best remediation solution. It will
probably come down to a replacement of obvious malformed extensions
with the correct extension. I will use a variation of the query
above substituting RegExFind with RegExReplace.
One important note when using this method is that is quite
costly in terms of performance. In our case it is being applied as
part of a one time data maintenance and cleanup operation. I would
not recommend using COM Automation like this in a high throughput
stored procedure without careful thought first. At this stage it
is probably well worth upgrading to SQL Server 2008 and using the
CLR.