2

This may be really basic, but I can't think of how to write a SQL query that would find strings that have the same characters but different capitalization.

The context I'm working on is a Rails 3.2 app. I have a simple Tag model with a Name attribute. I've inherited data for this model that did not store values case-insensitively, so some users input things like "Tree" while others input "tree" and now we have two tags that really should be one.

So, I'd like to do a query to find all these pairs so that I can go about merging them.

The only thing I can think of so far is to write a rake task that loops through them all and checks for matching values... something like:

pairs = []
Tag.all.each do |t|
  other = Tag.where( 'name LIKE ?', t.name )
  pairs << [t, other] if other
end

However, I'm not sure the above would work, or that it makes sense performance-wise. Is there a better way to write a SQL query that would find these matching pairs?

Andrew
  • 40,445
  • 46
  • 172
  • 276

3 Answers3

2

There is a question similar to this here

What you can do is take that answer a create a method in your model to do a case insensitive search. From what i've experience however is that ActiveRecord already does case insensitive search but just in case:

def self.insensitive_find_by_tag_name(name)
    Tag.where("lower(name) = ? ", name.downcase)
end

and then to remove duplicate entries, you can do something like this

Tag.transaction! do
    tags = Tag.insensitive_find_by_tag_name(name)

    tags.last(tags.length() - 1).each do |tag|
        tag.destroy        
    end
end

Call transaction just in case anything fails so the database will rollback. Grab all tags with the same name, then delete any extra entries. If you want the remaining tag entry to be lower case then you can do

tag = tags.first
tag.name = tag.name.downcase
tag.save!
Community
  • 1
  • 1
skukx
  • 598
  • 3
  • 15
0

I'm not super good at SQL, but I researched this a bit and found out that using the COLLATE clause can be used to make string operations case sensitive in SQL. (typically select distinct operations are case insensitive.)

so maybe you could try:

select distinct (name) COLLATE sql_latin1_general_cp1_cs_as
    FROM (
    ... blah blah blah

Here is some documentation on collate: http://dev.mysql.com/doc/refman/5.0/en/charset-collate.html

(assuming you're using mysql I guess)

Alternatively you could also reconfigure your database to be case sensitive via collate also. Then your current query might work unaltered

(assuming you have administrative permissions and ability to reconfigure)

diego
  • 123
  • 2
  • 13
0

You should use upper() or lower() functions to convert the names all to lower or upper case.

SELECT DISTINCT upper(name)

Or:

SELECT DISTINCT lower(name)

Source: http://www.postgresql.org/docs/9.1/static/functions-string.html

Another option (better for maintainability of code) is to use the CITEXT type, but to do this you have to modify your table structure: http://www.postgresql.org/docs/9.1/static/citext.html

mastazi
  • 1,279
  • 21
  • 34