Friday, February 24, 2012

COOK-ONEIL vs. Cook-oneil

SQL Server 2000

Does anybody know of a good T-SQL engine for de-mangling names?

We're getting names in all caps and need to put them into Proper Case. This is, of course, trivial until you meet the real world where Miss Cook marries Mr. O'Neill and decides to hyphenate. Then of course there was Ms. MacArthur who just shouldn't exist at all, even before she marries Mr. O'Flanahan and has a brood of MacArthur-O'Flanahans.

Surely there is something out there that fixes this?

Thanks for any recommendations!As nonstandard capitialization is by definition non-standard, the only definitive answer is a whole flock of secretaries that either know the answers or will ask the appropriate people to get them. Names are like laws, they are arbitrary, so no logic in the universe will help you reason them out.

That being said, there are a number of packages that do a good job at guessing. It isn't even hard to write a good "case fixer" that will get about 98% of the names right. The problem still goes back to the arbitrary nature of names, so you'll never know which ones are wrong without manual intervention!

-PatP|||It may be the case that it isn't very hard to write something in TSql that will do the job, 98% of the time. However, the reason I was asking it that I have to have it complete and tested (as part of a whole pile of other work) by Monday.

Hence, as a vb programmer slogging through TSQL, I was hoping for hyperlinks.|||For what it is worth, here is an Access VB function I wrote some time back to deal with this issue.

You can either convert the logic to SQL, or put it in an Access ADP file linked to your database for a one-time run.

--------------
Function PROPERCAP(TEXT_STRING)
'B. Lindman
'1997
'Returns the name of a person or business with proper capitalization.

Dim TRIMMED_STRING As String
Dim PROPER_STRING As String
Dim CHARACTER_NUMBER As Integer
Dim STRING_LENGTH As Integer
Dim EXTRA_SPACE
Dim NEXT_CAP

If Len(TEXT_STRING) > 0 Then
TRIMMED_STRING = Trim(TEXT_STRING)
End If
STRING_LENGTH = Len(TRIMMED_STRING)
NEXT_CAP = True

If STRING_LENGTH > 0 Then
For CHARACTER_NUMBER = 1 To STRING_LENGTH

'CHECK FOR EXTRA SPACE CHARACTERS
EXTRA_SPACE = False
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = " " Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 1, 1) = " " Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 2, 1) <> "." Then EXTRA_SPACE = True
End If
End If

'BUILD PROPER STRING
If Not EXTRA_SPACE Then
Select Case NEXT_CAP
Case True
PROPER_STRING = PROPER_STRING & UCase(Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1))
Case False
PROPER_STRING = PROPER_STRING & LCase(Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1))
End Select
End If

'DETERMINE CAPITALIZATION STATUS OF NEXT CHARACTER
NEXT_CAP = False
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = "." Then NEXT_CAP = True
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = " " Then NEXT_CAP = True
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = "-" Then NEXT_CAP = True
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = "(" Then NEXT_CAP = True
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = "/" Then NEXT_CAP = True
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = "I" Then
If CHARACTER_NUMBER > 3 Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 1, 1) = " " Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER + 1, 1) = "I" Then
NEXT_CAP = True
End If
Else
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 1, 1) = "I" Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 2, 1) = " " Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER + 1, 1) = "I" Then
NEXT_CAP = True
End If
End If
End If
End If
End If
End If
If Mid(TRIMMED_STRING, CHARACTER_NUMBER, 1) = "c" Then
If CHARACTER_NUMBER > 1 Then
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 1, 1) = "M" Then
If CHARACTER_NUMBER - 2 = 0 Then
NEXT_CAP = True
Else
If Mid(TRIMMED_STRING, CHARACTER_NUMBER - 2, 1) = " " Then
NEXT_CAP = True
End If
End If
End If
End If
End If

Next CHARACTER_NUMBER

End If

If Len(PROPER_STRING) > 0 Then
PROPERCAP = PROPER_STRING
Else
PROPERCAP = Null
End If

End Function|||Hey, does B. stand for Bob?|||No! I am the ORIGINAL blindman!|||Oh, and I am an airplane|||Originally posted by rdjabarov
Oh, and I am an airplane I'd always wondered about that. Thanks for clearing it up.

-PatP|||If an 85-90% solution will do, you can also use:CREATE TABLE names (
name VARCHAR(50)
)

INSERT INTO names ([name])
SELECT 'TILLEY O''TOOL'
UNION ALL SELECT 'PHRED PHARQUAR'
UNION ALL SELECT 'JOHN KISS-BUTTS'
UNION ALL SELECT 'RENE MACDONALD'
UNION ALL SELECT 'CRAIG MCDERMOT'

UPDATE names
SET [name] = Upper(Left(LTrim([name]), 1))
+ Lower(SubString(LTrim(Rtrim([name])), 2, 8000))

DECLARE @.cName VARCHAR(50)
, @.i INT

DECLARE zFixes CURSOR FOR SELECT [name]
FROM names
WHERE [name] LIKE '%[^A-Za-z][a-z]%'

OPEN zFixes
FETCH zFixes INTO @.cName

WHILE 0 = @.@.fetch_status
BEGIN
SET @.i = PatIndex('%[^A-Za-z][a-z]%', @.cName)
WHILE 0 < @.i
BEGIN
SET @.cName = SubString(@.cName, 1, @.i - 1)
+ Upper(SubString(@.cName, @.i, 2))
+ SubString(@.cName, @.i + 2, 8000)

SET @.i = PatIndex('%[^A-Za-z][a-z]%'
, @.cName COLLATE Latin1_General_BIN )
END

UPDATE names SET [name] = @.cName WHERE CURRENT OF zFixes
FETCH zFixes INTO @.cName
END

CLOSE zFixes
DEALLOCATE zFixes

SELECT * FROM names

DROP TABLE namesIt doesn't deal with exceptions, although you can add them several ways at the end. The biggest problem with figuring out how you want to process the exceptions is that nearly all of the methods are mutually exclusive, using one precludes the use of others. You have to know the data you want to fix (the exceptions you want to handle) in order to pick the right algorithm for coping with them.

-PatP|||Thanks Pat,

I'll have to look at it tomorrow. I spent all afternoon converting/writing a vb fix in vb that used only functions that also work in vbScript. Worked well in vb.

When I moved it into my vbScript module it flopped. What becomes 'O'Neill - McRoberts' in vb ends up as O'neil-mcroberts' in vbScript. Grrrrrrr!|||Fly djabarov.

Do you offer frequent flier miles?|||Hmmm, I think you already know my response, so let's keep it out of here...unless you want a piece of me?|||Originally posted by rdjabarov
Hmmm, I think you already know my response, so let's keep it out of here...unless you want a piece of me? Great zot! Did you mean to leave that one hanging wide open like that? Talk about a perfect opportunity for multiple straight lines!

I suppose I have to behave, but it is hard to pass up temptation like that!

-PatP|||Ain't it amazing how fast we end up in the gutter...

Nice code pat...I got one here or abother forummm...|||Here's my Attempt

http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=26584|||Thanks very much for your help. Since vb is what I write fastest I've got something working in vbScript that hits the data as it comes in. It handles everything except the dreaded 'Mac' issue.

I will, however, put your lovely examples into my source code stash for the next time.

If for any reason you would like my not so lovely, but fairly functional example just let me know--but I assume all of you SQL gurus don't mess around with these evil DTS packages much.

Thanks again!|||Originally posted by Fly Girl
If for any reason you would like my not so lovely, but fairly functional example just let me know--but I assume all of you SQL gurus don't mess around with these evil DTS packages much.

Thanks again! Sure, I'd like to see your solution. Maybe I can offer some comments, and for sure it will give me insight into how to interpret similar requests in the future.

-PatP|||The code is below. It isn't particularly spiff--and I would do it character by character if I had to do it again, but it seems to handle most of our current name issues.

Note also that I've done limited testing -- just passed it off for full testing on about 48000 names.

Function NameCase(ByVal strLastName)

' Find out if there is an apostrophe or hyphen in there
intH = InStr(1, strLastName, "-")
intA = InStr(1, strLastName, "'")

' Begin tearing the name apart
arStr = Split(strLastName)

For n = 0 To UBound(arStr)

If Len(arStr(n)) > 0 Then

' Look for the 'Mc' thing
strMc = Mid(arStr(n), 1, 2)

If strMc = "MC" Or strMc = "Mc" Or strMc = "mc" Then

bolMc = True
arStr(n) = "Mc" & UCase(Mid(arStr(n), 3, 1)) & LCase(Mid(arStr(n), 4))
Else
bolMc = False

End If

If intH > 0 Then
' Check for a hyphen in this chunk of the name
arHyphen = Split(arStr(n), "-")

If UBound(arHyphen) > 0 Then

For i = 0 To UBound(arHyphen)

If Len(arHyphen(i)) > 0 Then

If bolMc Then
If i = 0 Then
bolHyphen = 1
bolMc = False
strNameH = arHyphen(i)
Else
arHyphen(i) = UCase(Left(arHyphen(i), 1)) & LCase(Mid(arHyphen(i), 2))
strNameH = strNameH & "-" & arHyphen(i)
End If
Else
If i = 0 Then
arHyphen(i) = UCase(Left(arHyphen(i), 1)) & LCase(Mid(arHyphen(i), 2))
bolHyphen = 1
strNameH = arHyphen(i)
Else
arHyphen(i) = UCase(Left(arHyphen(i), 1)) & LCase(Mid(arHyphen(i), 2))
strNameH = strNameH & "-" & arHyphen(i)
End If
End If
End If

Next
Else
bolHyphen = False

End If

End If

If intA > 0 Then
' Check for an apostrophe in this chunk of the name
arApost = Split(arStr(n), "'")

If UBound(arApost) > 0 Then
For i = 0 To UBound(arApost)

If Len(arApost(i)) > 0 Then

If bolHyphen Then
' figure out where the hyphen is
If intH > intA Then

' The hyphen is after the apostrophe so
' it will be in then second chunk
If i = 0 Then
arApost(i) = UCase(Left(arApost(i), 1)) & LCase(Mid(arApost(i), 2))
strNameA = arApost(i)
Else
' Trim down the string so that the stuff after the Apostrophe won't
' be converted to lower case
intH2 = InStr(1, arApost(i), "-")
intNamePart = Len(strNameH)
strNamePart = Mid(arApost(i), 1, intH2 - 1)
strRemainder = Mid(strNameH, intH, intNamePart)
strNamePart = UCase(Left(strNamePart, 1)) & LCase(Mid(strNamePart, 2))
strNameA = strNameA & "'" & strNamePart & strRemainder

End If
Else

' The hyphen is before the apostrophe

If i = 0 Then
' Trim down the string so that the stuff after the Apostrophe won't
' ' be converted to lower case
strRemainder = Mid(strNameH, 1, intH + 1)
strNameA = strRemainder
Else
arApost(i) = UCase(Left(arApost(i), 1)) & LCase(Mid(arApost(i), 2))
strNameA = strNameA & "'" & arApost(i)
End If

End If
Else

If i = 0 Then
arApost(i) = UCase(Left(arApost(i), 1)) & LCase(Mid(arApost(i), 2))
strNameA = arApost(i)
bolApost = 1
Else
arApost(i) = UCase(Left(arApost(i), 1)) & LCase(Mid(arApost(i), 2))
strNameA = strNameA & "'" & arApost(i)
End If

End If
End If
Next

Else
bolApost = False

End If
End If

If Not bolHyphen And Not bolApost And Not bolMc Then

arStr(n) = UCase(Left(arStr(n), 1)) & LCase(Mid(arStr(n), 2))

End If

If strNameH = "" And strNameA = "" Then

strName = strName & arStr(n)

Else

If strNameH = "" Or strNameA = "" Then

strName = strName & strNameH & strNameA

Else
' If there was both a hyphen and an apostrophe then
' the name will be in strNameA
strName = strName & strNameA

End If

End If

strNameA = ""
strNameH = ""

End If
Next

NameCase = strName

End Function

Enjoy!

No comments:

Post a Comment