-
What is this “collation” stuff anyway?
As documented under Character Sets and Collations in General:
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let’s make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters: “
A”, “B”, “a”, “b”. We give each letter a number: “A” = 0, “B” = 1, “a” = 2, “b” = 3. The letter “A” is a symbol, the number 0 is the encoding for “A”, and the combination of all four letters and their encodings is a character set.Suppose that we want to compare two string values, “
A” and “B”. The simplest way to do this is to look at the encodings: 0 for “A” and 1 for “B”. Because 0 is less than 1, we say “A” is less than “B”. What we’ve just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters “
a” and “b” as equivalent to “A” and “B”; (2) then compare the encodings. We call this a case-insensitive collation. It is a little more complex than a binary collation.In real life, most character sets have many characters: not just “
A” and “B” but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German “Ö”), and for multiple-character mappings (such as the rule that “Ö” = “OE” in one of the two German collations).Further examples are given under Examples of the Effect of Collation.
-
Okay, but how does MySQL decide which collation to use for a given expression?
As documented under Collation of Expressions:
In the great majority of statements, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases, it should be clear that the collation is the collation of column
charset_name:SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;However, with multiple operands, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';Should the comparison use the collation of the column
x, or of the string literal'Y'? Bothxand'Y'have collations, so which collation takes precedence?Standard SQL resolves such questions using what used to be called “coercibility” rules.
[ deletia ]
MySQL uses coercibility values with the following rules to resolve ambiguities:
-
Use the collation with the lowest coercibility value.
-
If both sides have the same coercibility, then:
-
If both sides are Unicode, or both sides are not Unicode, it is an error.
-
If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement does not return an error:
SELECT CONCAT(utf8_column, latin1_column) FROM t1;It returns a result that has a character set of
utf8and the same collation asutf8_column. Values oflatin1_columnare automatically converted toutf8before concatenating. -
For an operation with operands from the same character set but that mix a
_bincollation and a_cior_cscollation, the_bincollation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, except that it is for collations rather than data types.
-
-
-
So what is an “illegal mix of collations”?
An “illegal mix of collations” occurs when an expression compares two strings of different collations but of equal coercibility and the coercibility rules cannot help to resolve the conflict. It is the situation described under the third bullet-point in the above quotation.
The particular error given in the question,
Illegal mix of collations (latin1_general_cs,IMPLICIT) and (latin1_general_ci,IMPLICIT) for operation '=', tells us that there was an equality comparison between two non-Unicode strings of equal coercibility. It furthermore tells us that the collations were not given explicitly in the statement but rather were implied from the strings’ sources (such as column metadata). -
That’s all very well, but how does one resolve such errors?
As the manual extracts quoted above suggest, this problem can be resolved in a number of ways, of which two are sensible and to be recommended:
-
Change the collation of one (or both) of the strings so that they match and there is no longer any ambiguity.
How this can be done depends upon from where the string has come: Literal expressions take the collation specified in the
collation_connectionsystem variable; values from tables take the collation specified in their column metadata. -
Force one string to not be coercible.
I omitted the following quote from the above:
MySQL assigns coercibility values as follows:
-
An explicit
COLLATEclause has a coercibility of 0. (Not coercible at all.) -
The concatenation of two strings with different collations has a coercibility of 1.
-
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
-
A “system constant” (the string returned by functions such as
USER()orVERSION()) has a coercibility of 3. -
The collation of a literal has a coercibility of 4.
-
NULLor an expression that is derived fromNULLhas a coercibility of 5.
Thus simply adding a
COLLATEclause to one of the strings used in the comparison will force use of that collation. -
Whilst the others would be terribly bad practice if they were deployed merely to resolve this error:
-
Force one (or both) of the strings to have some other coercibility value so that one takes precedence.
Use of
CONCAT()orCONCAT_WS()would result in a string with a coercibility of 1; and (if in a stored routine) use of parameters/local variables would result in strings with a coercibility of 2. -
Change the encodings of one (or both) of the strings so that one is Unicode and the other is not.
This could be done via transcoding with
CONVERT(expr USING transcoding_name); or via changing the underlying character set of the data (e.g. modifying the column, changingcharacter_set_connectionfor literal values, or sending them from the client in a different encoding and changingcharacter_set_client/ adding a character set introducer). Note that changing encoding will lead to other problems if some desired characters cannot be encoded in the new character set. -
Change the encodings of one (or both) of the strings so that they are both the same and change one string to use the relevant
_bincollation.Methods for changing encodings and collations have been detailed above. This approach would be of little use if one actually needs to apply more advanced collation rules than are offered by the
_bincollation.
-
-
4Note that “illegal mix of collations” can also arise when there is no ambiguity over which collation should be used, but the string that is to be coerced must be transcoded to an encoding in which some of its characters cannot be represented. I have discussed this case in a previous answer. – eggyal Jan 11 ’14 at 11:06
-
4Great answer. This one should be the further up, because it dives into what developers should really know; not just how to fix it, but really understand why things are happening the way they;re happening. – mark Apr 9 ’14 at 9:26
-
Thanks dude, you taught me something today. – briankip Jan 6 ’15 at 15:30