Monday, September 28, 2015

Wildcard Searches

http://www.sqlservercentral.com/articles/T-SQL/130558/


Introduction
When dealing with strings, more often than not, we don’t know the exact contents of the column or sometimes we’re not sure on the spelling. Can you spell Schwarzenegger correctly without help every single time? Probably not. Fortunately for us, we have wildcards available to search for patterns.

The available options

There are two options available in T-SQL to search for patterns in strings:
  • The LIKE operator
  • The PATINDEX function
Both are used in different situations. The LIKE operator is meant for comparisons, to be sure that a pattern exists or not in a string. The PATINDEX function will return the starting position of a pattern in a string or zero if the string does not contain a pattern. Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article. I’ll be focusing on the LIKE operator, but the concepts translate directly to the PATINDEX function.
There are also four different wildcards:
  • Percent character (%): Matches any string of zero or more characters.
  • Underscore (_): Matches any single character.
  • Square brackets ([]): Matches a single character contained between them either as a range, a set or a combination of both.
  • Square brackets negative match ([^]): Matches any single character not contained between them either as a range, a set or a combination of both.
These wildcards can be combined to generate different patterns to search within strings.

Basic searches

For this article, I’m using a simple table which you can recreate to follow the examples, either by copying the code or by downloading it from the Resources section.
USE tempdb;  --Be sure to test in a safe environment
GO

CREATE TABLE dbo.LIKETest(
    id      int IDENTITY, --CONSTRAINT PK_LIKETest PRIMARY KEY NONCLUSTERED,
    Name    varchar(1000),
    phone   varchar(1000),
    last_movie_release  date,
    amount  varchar(1000),
    comments varchar(1000)
);
CREATE CLUSTERED INDEX IX_LIKETest ON dbo.LIKETest(Name);

INSERT INTO dbo.LIKETest(Name, phone, last_movie_release, amount, comments)
VALUES
    ( 'Bruce Wayne',     'Confidential',   '20120720', '35131'        , 'Reach at email: bwayne@WayneIndustries.com'),
    ( 'Clark Kent',      '8457390095',     '20130614', '58455.64'     , 'Work email: ckent@daily_planet.com'),
    ( 'Richard Grayson', '212-555-0187212-555-0187 FREE',   '19970620', '.63521'       , 'Known as Dick Grayson'),
    ( 'Diana Prince',    '849-555-0139',   NULL      , '58485.'       , 'Amazon princess, treat with respect'),
    ( 'J''onn J''onzz',  'N/A',            NULL      , '-15612'       , 'Last Martian'),
    ( 'Barry Allen',     '(697) 555-0142', NULL      , '-1.5413'      , 'Too fast'),
    ( 'Reed Richards',   '917-330-2568917-330-2568 FREE',   '20150807', '-4156-15'     , NULL),
    ( 'Susan Storm',     '917-970-0138917-970-0138 FREE',   '20150807', '156.516.51'   , NULL),
    ( 'Johnny Storm',    '917-913-0172917-913-0172 FREE',   '20150807', '665465-'      , NULL),
    ( 'Ben Grimm',       '917-708-0141917-708-0141 FREE',   '20150807', 'One Thousand' , NULL),
    ( 'Peter Parker',    '917-919-0140917-919-0140 FREE',   '20140502', '56E6546'      , 'With great power comes great responsibility'),
    ( 'Tony Stark',      '492-167-0139',   '20130503', '$'            , ''),
    ( 'Wade Wilson',     '692-257-1937',   NULL      , 'ss'           , 'Just 50% hero'),
    ( 'Bruce Banner',    '781-167-4628781-167-4628 FREE',   '20080613', 'FFFFFF'       , 'sdo@a#%^add34.voi');

Starting by ‘some string’

This is the best option if you want to make an optimal use of indexes. It’s the only option that can lead to an index seek operation. Here are five basic examples that show this option.
--A) Return all rows when the name starts by B   
SELECT *
 FROM dbo.LIKETest
 WHERE Name LIKE 'B%';

--B) Return all rows when the phone starts by 917
SELECT *
 FROM dbo.LIKETest
 WHERE phone LIKE '917%';

--C) Return all rows when the name starts by any character between A and L
SELECT *
 FROM dbo.LIKETest 
 WHERE Name LIKE '[A-D]%';

--D) Return all rows when the name starts by the characters C, D or W
SELECT * 
 FROM dbo.LIKETest 
 WHERE Name LIKE N'[CDW]%';

--E) Return all rows when the last_movie_release starts by 2015
SELECT * 
 FROM dbo.LIKETest 
 WHERE last_movie_release LIKE '2015%';

As we can see, we can use one or more characters, a range of characters, a set of characters or even a year from a date column. However, one of these things is not like the others, one of these things just doesn't belong.
The first 4 queries generate a very similar execution plan: A SELECT and an Index Seek.

The fifth changes the Index Seek for an Index Scan and depending on your version of SQL Server, it will show a warning when expanded will denote an implicit conversion.

A few points illustrated by this last query:
  • You can use LIKE to search within columns from different data types.
  • To be able to compare, SQL Server implicitly converts the different type to a character string.
  • Whenever a conversion to a column is needed, the expression becomes non-SARGable.
  • If the column was a datetime instead of a date, there would be no results as the strings wouldn’t start with the year.

Ends with ‘some string’

This one is commonly used when searching for extensions in a column containing file names, but it’s not restricted to it. In this example, I’m querying for a last name.
SELECT * 
 FROM dbo.LIKETest
 WHERE Name LIKE '%Parker';

Contains ‘some string’
Using leading and trailing wildcards can help you to find a string within another string.
SELECT *
 FROM dbo.LIKETest
 WHERE Name LIKE '%Richard%';

Contains ‘exact word’

If you run the previous query, you’ll find out that it returns Richard Grayson as well as Reed Richards. To get the exact word you need additional validations.
Some people just add spaces to both sides of the word:
SELECT *
 FROM dbo.LIKETest
 WHERE Name LIKE '% Richard %';

That helps only if the word is surrounded by spaces and it’s not at the beginning or the end of the string. In this case, it no longer returns any rows.
To allow characters other than spaces, we change them to a negative match [^A-Za-z]. This will include all the characters that aren’t part of those ranges. It’s a good idea to include both ranges to avoid problems with case sensitive collations.
To allow that the word occurs at the beginning or end of the string, we just add characters to both sides of the column. This was already a non-SARGable condition, so it won’t impact performance. The condition will then look like this:
SELECT *
 FROM dbo.LIKETest
 WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';

The N character is ‘some character’

When using the % wildcard, we don’t know the number of characters. To know if a character or strings occur at a certain position, we can use the underscore to accept any character. If you’re using several underscores, you can use REPLICATE() to make it easier to count the number of underscores. The second parameter should be the position of the searched string minus 1.
These examples show how to search the names with a letter n in the fourth position either from left to right or right to left.
SELECT *
 FROM dbo.LIKETest
 WHERE Name LIKE REPLICATE('_', 3) + 'n%';

SELECT *
 FROM dbo.LIKETest
 WHERE Name LIKE '%n' + REPLICATE('_', 3);

Only valid strings
I know most of us work on perfectly designed databases that won’t contain tables like the one on our example. But from time to time, we need to validate the information entered in a column. This will get all the rows where the column amount can be converted to a decimal.
SELECT *
 FROM dbo.LIKETest
 WHERE amount NOT LIKE '%[^-0-9.]%' --Only digits, decimal points and minus signs
 AND amount NOT LIKE '%[.]%[.]%' --Only one decimal point allowed
 AND amount NOT LIKE '_%[-]%'; --Minus sign should only appear at the beginning of the string

Of course, you can continue to include more validations such as length, but this would give you an idea. To bring the other set of rows (the invalid ones), you can remove the NOT and change the ANDs for ORs.
SELECT *
 FROM dbo.LIKETest
 WHERE amount LIKE '%[^-0-9.]%' --Only digits, decimal points and minus signs
 OR amount LIKE '%[.]%[.]%' --Only one decimal point allowed
 OR amount LIKE '_%[-]%'; --Minus sign should only appear at the beginning of the string

This is explained in more detail here: http://www.sqlservercentral.com/articles/ISNUMERIC()/71512/
There are plenty of examples on how to do other validations. Some patterns include the following:
'[1-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' Basic validation for date as YYYYMMDD. The first digit can’t be zero. The fifth digit, first for month, should be zero or one. The seventh digit, first for day, should be between zero and 3.
'[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]' Similar to the previous one but with format YYYY-MM-DD
'#[A-F0-9][A-F0-9][A-F0-9][A-F0-9][A-F0-9][A-F0-9]' Hexadecimal RGB code.

Common problems

Here are some of the common problems that developers find with wildcard searches.

Not returning all the results

A common problem is believing that using LIKE ‘%’ will return all the rows. This isn’t true when NULLs are present in the column used to compare as you can see with the following query.
SELECT *
 FROM dbo.LIKETest
 WHERE comments LIKE '%';

Need to include characters used as wildcards.

When you need to search for a value that includes a percentage or an underscore, there are 2 options: Including the ESCAPE option (only available when using LIKE) and including them in square brackets.
SELECT *
 FROM dbo.LIKETest
 WHERE comments LIKE '%[0-9]~%%' ESCAPE '~';

SELECT *
 FROM dbo.LIKETest
 WHERE comments LIKE '%[0-9][%]%';

Ignoring trailing (or leading) spaces
This one gave me some headaches when I started. When using LIKE or PATINDEX all the characters matter, even the trailing spaces. Remember that a char or nchar data type includes all the trailing spaces, the best option to store the pattern is in a varchar or nvarchar to prevent the problem showed below.
DECLARE @Name char(50) = 'Bruce';

SET @Name = RTRIM( @Name) + '%';

SELECT @Name;

SELECT *
 FROM dbo.LIKETest
 WHERE Name LIKE @Name;

With the variable as it is, it searches for names that start by Bruce and end with 44 spaces. Changing the variable type to varchar will return the 2 expected rows.

Thinking that because something complies with a rule, it won’t bring invalid values

The most common case is when people think that using LIKE ‘%[0-9]%’ will return only digits because they’re looking for numeric values. The truth is that it will return any rows where the string includes at least one digit. We’ve seen how to solve this problem, but there are others which appear to be more complicated such as looking for emails in a free text column.
SELECT *
 FROM dbo.LIKETest
 WHERE comments LIKE '%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%';
The validation can get very complicated very easily and still get false positives or false negatives.

Going further

There are lots of things that can be done with patterns, like a pattern based splitter. I tried to cover some basics, but it’s impossible to cover them all. Please share your examples in the comments to know what you have done with the wildcards in SQL.


Wildcard searches sql

USE tempdb; --Be sure to test in a safe environment
GO
CREATE TABLE dbo.LIKETest(
id int IDENTITY, --CONSTRAINT PK_LIKETest PRIMARY KEY NONCLUSTERED,
Name varchar(1000),
phone varchar(1000),
last_movie_release date,
amount varchar(1000),
comments varchar(1000)
);
CREATE CLUSTERED INDEX IX_LIKETest ON dbo.LIKETest(Name);
INSERT INTO dbo.LIKETest(Name, phone, last_movie_release, amount, comments)
VALUES
( 'Bruce Wayne', 'Confidential', '20120720', '35131' , 'Reach at email: bwayne@WayneIndustries.com'),
( 'Clark Kent', '8457390095', '20130614', '58455.64' , 'Work email: ckent@daily_planet.com'),
( 'Richard Grayson', '212-555-0187', '19970620', '.63521' , 'Known as Dick Grayson'),
( 'Diana Prince', '849-555-0139', NULL , '58485.' , 'Amazon princess, treat with respect'),
( 'J''onn J''onzz', 'N/A', NULL , '-15612' , 'Last Martian'),
( 'Barry Allen', '(697) 555-0142', NULL , '-1.5413' , 'Too fast'),
( 'Reed Richards', '917-330-2568', '20150807', '-4156-15' , NULL),
( 'Susan Storm', '917-970-0138', '20150807', '156.516.51' , NULL),
( 'Johnny Storm', '917-913-0172', '20150807', '665465-' , NULL),
( 'Ben Grimm', '917-708-0141', '20150807', 'One Thousand' , NULL),
( 'Peter Parker', '917-919-0140', '20140502', '56E6546' , 'With great power comes great responsibility'),
( 'Tony Stark', '492-167-0139', '20130503', '$' , ''),
( 'Wade Wilson', '692-257-1937', NULL , 'ss' , 'Just 50% hero'),
( 'Bruce Banner', '781-167-4628', '20080613', 'FFFFFF' , 'sdo@a#%^add34.voi');
----------------------------------------------------------------------------------------
-- Basic Searches
----------------------------------------------------------------------------------------
--A) Return all rows when the name starts by B
SELECT * FROM dbo.LIKETest WHERE Name LIKE 'B%';
--B) Return all rows when the phone starts by 917
SELECT * FROM dbo.LIKETest WHERE phone LIKE '917%';
--C) Return all rows when the name starts by any character between A and L
SELECT * FROM dbo.LIKETest WHERE Name LIKE '[A-D]%';
--D) Return all rows when the name starts by the characters C, D or W
SELECT * FROM dbo.LIKETest WHERE Name LIKE N'[CDW]%';
--E) Return all rows when the last_movie_release starts by 2015
SELECT * FROM dbo.LIKETest WHERE last_movie_release LIKE '2015%';
--F) Ends with ‘some string’
SELECT * FROM dbo.LIKETest WHERE Name LIKE '%Parker';
--G) Contains ‘some string’
SELECT * FROM dbo.LIKETest WHERE Name LIKE '%Richard%';
--H) Contains ‘exact word’
SELECT * FROM dbo.LIKETest WHERE Name LIKE '% Richard %';
SELECT * FROM dbo.LIKETest WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';
--I) The N character is ‘some character’
SELECT * FROM dbo.LIKETest WHERE Name LIKE REPLICATE('_', 3) + 'n%';
SELECT * FROM dbo.LIKETest WHERE Name LIKE '%n' + REPLICATE('_', 3);
--J) Only valid strings
SELECT *
FROM dbo.LIKETest
WHERE amount NOT LIKE '%[^-0-9.]%' --Only digits, decimal points and minus signs
AND amount NOT LIKE '%[.]%[.]%' --Only one decimal point allowed
AND amount NOT LIKE '_%[-]%'; --Minus sign should only appear at the beginning of the string
--Or invalid strings
SELECT *
FROM dbo.LIKETest
WHERE amount LIKE '%[^-0-9.]%' --Only digits, decimal points and minus signs
OR amount LIKE '%[.]%[.]%' --Only one decimal point allowed
OR amount LIKE '_%[-]%'; --Minus sign should only appear at the beginning of the string
----------------------------------------------------------------------------------------
-- Common problems
----------------------------------------------------------------------------------------
--K) Not returning all the results
SELECT * FROM dbo.LIKETest WHERE comments LIKE '%';
--L) Need to include characters used as wildcards
SELECT * FROM dbo.LIKETest WHERE comments LIKE '%[0-9]~%%' ESCAPE '~';
SELECT * FROM dbo.LIKETest WHERE comments LIKE '%[0-9][%]%';
--M) Ignoring trailing (or leading) spaces
DECLARE @Name char(50) = 'Bruce';
SET @Name = RTRIM( @Name) + '%';
SELECT @Name;
SELECT * FROM dbo.LIKETest WHERE Name LIKE @Name;
--N) Thinking that because something complies with a rule, it won’t bring invalid values
SELECT *
FROM dbo.LIKETest
WHERE comments LIKE '%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%';
 
GO
DROP TABLE dbo.LIKETest;