0

I have a table with one column whose encoding is cp1252 and collation is latin_swedish_ci, and I need to change it to utf8_general_ci.

I'd like to check if I'm not going to end up with weird characters in one of the rows due to the conversion.

This column stores domain names, and I'm unsure whether or not I have swedish characters in one of the rows.

I've been researching this but I haven't been able to find a way to check for data's integrity before changing the collection.

My best guess so far is to write a script to check if there's a column that doesn't contain any of the english alphabet characters, but I'm pretty sure that there's a better way to do this.

Any help would be great!

UPDATE

I've found multiple rows with garbage like this:

ÜZìp;ìê+ØeÞ{/e¼ðP;

Is there a way to ged rid of that junk without examining row per row?

ILikeTacos
  • 13,823
  • 16
  • 52
  • 81

1 Answers1

1

The canonical way for this is to try it out:

  • Use SHOW CREATE TABLE to create an identically-structured testing table
  • Use INSERT INTO .. SELECT .. to populate the testing table with the primary key and relevant column(s) of the original
  • Try out conversion, noting necessary steps to fix problems
  • Rinse and repeat
Eugen Rieck
  • 60,102
  • 9
  • 66
  • 89
  • Awesome, this is what I did in the beginning, but since I have more than 50K domain names, I wanted to make sure that I wasn't missing something. – ILikeTacos Jun 14 '13 at 13:44
  • I just edited my question, would you mind taking a look at it again? – ILikeTacos Jun 14 '13 at 14:26
  • If these are domain names, you might want to use a regex (see http://stackoverflow.com/questions/10306690/domain-name-validation-with-regex) and run `SELECT * FROM table WHERE domain NOT RLIKE regex` – Eugen Rieck Jun 14 '13 at 14:54
  • That's what I ultimately did, basically flagging all the domains that did not match my regex. – ILikeTacos Jun 14 '13 at 15:23