GLib Reference Manual | |||
---|---|---|---|
<<< Previous Page | Home | Up | Next Page >>> |
This section describes a number of functions for dealing with Unicode characters and strings. There are analogues of the traditional ctype.h character classification and case conversion functions, UTF-8 analogues of some string utility functions, functions to perform normalization, case conversion and collation on UTF-8 strings and finally functions to convert between the UTF-8, UTF-16 and UCS-4 encodings of Unicode.
gboolean g_unichar_validate (gunichar ch); |
Checks whether ch is a valid Unicode character. Some possible integer values of ch will not be valid. 0 is considered a valid character, though it's normally a string terminator.
gboolean g_unichar_isalnum (gunichar c); |
Determines whether a character is alphanumeric. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_isalpha (gunichar c); |
Determines whether a character is alphabetic (i.e. a letter). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_iscntrl (gunichar c); |
Determines whether a character is a control character. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_isdigit (gunichar c); |
Determines whether a character is numeric (i.e. a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_isgraph (gunichar c); |
Determines whether a character is printable and not a space (returns FALSE for control characters, format characters, and spaces). g_unichar_isprint() is similar, but returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_islower (gunichar c); |
Determines whether a character is a lowercase letter. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_isprint (gunichar c); |
Determines whether a character is printable. Unlike g_unichar_isgraph(), returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_ispunct (gunichar c); |
Determines whether a character is punctuation or a symbol. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
gboolean g_unichar_isspace (gunichar c); |
Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
(Note: don't use this to do word breaking; you have to use Pango or equivalent to get word breaking right, the algorithm is fairly complex.)
gboolean g_unichar_isupper (gunichar c); |
Determines if a character is uppercase.
gboolean g_unichar_isxdigit (gunichar c); |
Determines if a character is a hexidecimal digit.
gboolean g_unichar_istitle (gunichar c); |
Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.
gboolean g_unichar_isdefined (gunichar c); |
Determines if a given character is assigned in the Unicode standard.
gboolean g_unichar_iswide (gunichar c); |
Determines if a character is typically rendered in a double-width cell.
gint g_unichar_digit_value (gunichar c); |
Determines the numeric value of a character as a decimal digit.
c : | a Unicode character |
Returns : | If c is a decimal digit (according to g_unichar_isdigit()), its numeric value. Otherwise, -1. |
gint g_unichar_xdigit_value (gunichar c); |
Determines the numeric value of a character as a hexidecimal digit.
c : | a Unicode character |
Returns : | If c is a hex digit (according to g_unichar_isxdigit()), its numeric value. Otherwise, -1. |
typedef enum { G_UNICODE_CONTROL, G_UNICODE_FORMAT, G_UNICODE_UNASSIGNED, G_UNICODE_PRIVATE_USE, G_UNICODE_SURROGATE, G_UNICODE_LOWERCASE_LETTER, G_UNICODE_MODIFIER_LETTER, G_UNICODE_OTHER_LETTER, G_UNICODE_TITLECASE_LETTER, G_UNICODE_UPPERCASE_LETTER, G_UNICODE_COMBINING_MARK, G_UNICODE_ENCLOSING_MARK, G_UNICODE_NON_SPACING_MARK, G_UNICODE_DECIMAL_NUMBER, G_UNICODE_LETTER_NUMBER, G_UNICODE_OTHER_NUMBER, G_UNICODE_CONNECT_PUNCTUATION, G_UNICODE_DASH_PUNCTUATION, G_UNICODE_CLOSE_PUNCTUATION, G_UNICODE_FINAL_PUNCTUATION, G_UNICODE_INITIAL_PUNCTUATION, G_UNICODE_OTHER_PUNCTUATION, G_UNICODE_OPEN_PUNCTUATION, G_UNICODE_CURRENCY_SYMBOL, G_UNICODE_MODIFIER_SYMBOL, G_UNICODE_MATH_SYMBOL, G_UNICODE_OTHER_SYMBOL, G_UNICODE_LINE_SEPARATOR, G_UNICODE_PARAGRAPH_SEPARATOR, G_UNICODE_SPACE_SEPARATOR } GUnicodeType; |
These are the possible character classifications. See http://www.unicode.org/Public/UNIDATA/UnicodeData.html.
typedef enum { G_UNICODE_BREAK_MANDATORY, G_UNICODE_BREAK_CARRIAGE_RETURN, G_UNICODE_BREAK_LINE_FEED, G_UNICODE_BREAK_COMBINING_MARK, G_UNICODE_BREAK_SURROGATE, G_UNICODE_BREAK_ZERO_WIDTH_SPACE, G_UNICODE_BREAK_INSEPARABLE, G_UNICODE_BREAK_NON_BREAKING_GLUE, G_UNICODE_BREAK_CONTINGENT, G_UNICODE_BREAK_SPACE, G_UNICODE_BREAK_AFTER, G_UNICODE_BREAK_BEFORE, G_UNICODE_BREAK_BEFORE_AND_AFTER, G_UNICODE_BREAK_HYPHEN, G_UNICODE_BREAK_NON_STARTER, G_UNICODE_BREAK_OPEN_PUNCTUATION, G_UNICODE_BREAK_CLOSE_PUNCTUATION, G_UNICODE_BREAK_QUOTATION, G_UNICODE_BREAK_EXCLAMATION, G_UNICODE_BREAK_IDEOGRAPHIC, G_UNICODE_BREAK_NUMERIC, G_UNICODE_BREAK_INFIX_SEPARATOR, G_UNICODE_BREAK_SYMBOL, G_UNICODE_BREAK_ALPHABETIC, G_UNICODE_BREAK_PREFIX, G_UNICODE_BREAK_POSTFIX, G_UNICODE_BREAK_COMPLEX_CONTEXT, G_UNICODE_BREAK_AMBIGUOUS, G_UNICODE_BREAK_UNKNOWN } GUnicodeBreakType; |
These are the possible line break classifications. See http://www.unicode.org/unicode/reports/tr14/.
GUnicodeBreakType g_unichar_break_type (gunichar c); |
Determines the break type of c. c should be a Unicode character (to derive a character from UTF-8 encoded text, use g_utf8_get_char()). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such as pango_break() instead of caring about break types yourself.
void g_unicode_canonical_ordering (gunichar *string, gsize len); |
Computes the canonical ordering of a string in-place. This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information.
gunichar* g_unicode_canonical_decomposition (gunichar ch, gsize *result_len); |
Computes the canonical decomposition of a Unicode character.
#define g_utf8_next_char(p) |
Skips to the next character in a UTF-8 string. The string must be valid; this macro is as fast as possible, and has no error-checking. You would use this macro to iterate over a string character by character. The macro returns the start of the next UTF-8 character. Before using this macro, use g_utf8_validate() to validate strings that may contain invalid UTF-8.
gunichar g_utf8_get_char (const gchar *p); |
Converts a sequence of bytes encoded as UTF-8 to a Unicode character. If p does not point to a valid UTF-8 encoded character, results are undefined. If you are not sure that the bytes are complete valid Unicode characters, you should use g_utf8_get_char_validated() instead.
gunichar g_utf8_get_char_validated (const gchar *p, gssize max_len); |
Convert a sequence of bytes encoded as UTF-8 to a Unicode character. This function checks for incomplete characters, for invalid characters such as characters that are out of the range of Unicode, and for overlong encodings of valid characters.
Return value: the resulting character. If p points to a partial sequence at the end of a string that could begin a valid character,
gchar* g_utf8_offset_to_pointer (const gchar *str, glong offset); |
Converts from an integer character offset to a pointer to a position within the string.
glong g_utf8_pointer_to_offset (const gchar *str, const gchar *pos); |
Converts from a pointer to position within a string to a integer character offset.
gchar* g_utf8_prev_char (const gchar *p); |
Finds the previous UTF-8 character in the string before p.
p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte. If p might be the first character of the string, you must use g_utf8_find_prev_char() instead.
gchar* g_utf8_find_next_char (const gchar *p, const gchar *end); |
Finds the start of the next UTF-8 character in the string after p.
p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte.
gchar* g_utf8_find_prev_char (const gchar *str, const gchar *p); |
Given a position p with a UTF-8 encoded string str, find the start of the previous UTF-8 character starting before p. Returns NULL if no UTF-8 characters are present in p before str.
p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte.
glong g_utf8_strlen (const gchar *p, gssize max); |
Returns the length of the string in characters.
gchar* g_utf8_strncpy (gchar *dest, const gchar *src, gsize n); |
Like the standard C strncpy() function, but copies a given number of characters instead of a given number of bytes. The src string must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)
gchar* g_utf8_strchr (const gchar *p, gssize len, gunichar c); |
Finds the leftmost occurrence of the given ISO10646 character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
gchar* g_utf8_strrchr (const gchar *p, gssize len, gunichar c); |
Find the rightmost occurrence of the given ISO10646 character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
gboolean g_utf8_validate (const gchar *str, gssize max_len, const gchar **end); |
Validates UTF-8 encoded text. str is the text to validate; if str is nul-terminated, then max_len can be -1, otherwise max_len should be the number of bytes to validate. If end is non-NULL, then the end of the valid range will be stored there (i.e. the address of the first invalid byte if some bytes were invalid, or the end of the text being validated otherwise).
Returns TRUE if all of str was valid. Many GLib and GTK+ routines require valid UTF-8 as input; so data read from a file or the network should be checked with g_utf8_validate() before doing anything else with it.
gchar* g_utf8_strup (const gchar *str, gssize len); |
Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)
gchar* g_utf8_strdown (const gchar *str, gssize len); |
Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.
gchar* g_utf8_casefold (const gchar *str, gssize len); |
Converts a string into a form that is independent of case. The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling g_utf8_casefold() on other strings.
Note that calling g_utf8_casefold() followed by g_utf8_collate() is only an approximation to the correct linguistic case insensitive ordering, though it is a fairly good one. Getting this exactly right would require a more sophisticated collation function that takes case sensitivity into account. GLib does not currently provide such a function.
gchar* g_utf8_normalize (const gchar *str, gssize len, GNormalizeMode mode); |
Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. You should generally call g_utf8_normalize() before comparing two Unicode strings.
The normalization mode G_NORMALIZE_DEFAULT only standardizes differences that do not affect the text content, such as the above-mentioned accent representation. G_NORMALIZE_ALL also standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. For example, g_utf8_collate() normalizes with G_NORMALIZE_ALL as its first step.
G_NORMALIZE_DEFAULT_COMPOSE and G_NORMALIZE_ALL_COMPOSE are like G_NORMALIZE_DEFAULT and G_NORMALIZE_ALL, but returned a result with composed forms rather than a maximally decomposed form. This is often useful if you intend to convert the string to a legacy encoding or pass it to a system with less capable Unicode handling.
typedef enum { G_NORMALIZE_DEFAULT, G_NORMALIZE_NFD = G_NORMALIZE_DEFAULT, G_NORMALIZE_DEFAULT_COMPOSE, G_NORMALIZE_NFC = G_NORMALIZE_DEFAULT_COMPOSE, G_NORMALIZE_ALL, G_NORMALIZE_NFKD = G_NORMALIZE_ALL, G_NORMALIZE_ALL_COMPOSE, G_NORMALIZE_NFKC = G_NORMALIZE_ALL_COMPOSE } GNormalizeMode; |
Defines how a Unicode string is transformed in a canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. Unicode strings should generally be normalized before comparing them.
G_NORMALIZE_DEFAULT | standardize differences that do not affect the text content, such as the above-mentioned accent representation. |
G_NORMALIZE_NFD | another name for G_NORMALIZE_DEFAULT. |
G_NORMALIZE_DEFAULT_COMPOSE | like G_NORMALIZE_DEFAULT, but with composed forms rather than a maximally decomposed form. |
G_NORMALIZE_NFC | another name for G_NORMALIZE_DEFAULT_COMPOSE. |
G_NORMALIZE_ALL | beyond G_NORMALIZE_DEFAULT also standardize the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. |
G_NORMALIZE_NFKD | another name for G_NORMALIZE_ALL. |
G_NORMALIZE_ALL_COMPOSE | like G_NORMALIZE_ALL, but with composed forms rather than a maximally decomposed form. |
G_NORMALIZE_NFKC | another name for G_NORMALIZE_ALL_COMPOSE. |
gint g_utf8_collate (const gchar *str1, const gchar *str2); |
Compares two strings for ordering using the linguistically correct rules for the current locale. When sorting a large number of strings, it will be significantly faster to obtain collation keys with g_utf8_collate_key() and compare the keys with strcmp() when sorting instead of sorting the original strings.
gchar* g_utf8_collate_key (const gchar *str, gssize len); |
Converts a string into a collation key that can be compared with other collation keys using strcmp(). The results of comparing the collation keys of two strings with strcmp() will always be the same as comparing the two original keys with g_utf8_collate().
str : | a UTF-8 encoded string. |
len : | length of str, in bytes, or -1 if str is nul-terminated. |
Returns : | a newly allocated string. This string should be freed with g_free() when you are done with it. |
gunichar2* g_utf8_to_utf16 (const gchar *str, glong len, glong *items_read, glong *items_written, GError **error); |
Convert a string from UTF-8 to UTF-16. A 0 word will be added to the result after the converted text.
str : | a UTF-8 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is nul-terminated. |
items_read : | location to store number of bytes read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. |
items_written : | location to store number of words written, or NULL. The value stored here does not include the trailing 0 word. |
error : | location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. |
Returns : | a pointer to a newly allocated UTF-16 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. |
gunichar* g_utf8_to_ucs4 (const gchar *str, glong len, glong *items_read, glong *items_written, GError **error); |
Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4. A trailing 0 will be added to the string after the converted text.
str : | a UTF-8 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is nul-terminated. |
items_read : | location to store number of bytes read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. |
items_written : | location to store number of characters written or NULL. The value here stored does not include the trailing 0 character. |
error : | location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. |
Returns : | a pointer to a newly allocated UCS-4 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. |
gunichar* g_utf8_to_ucs4_fast (const gchar *str, glong len, glong *items_written); |
Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input. This function is roughly twice as fast as g_utf8_to_ucs4() but does no error checking on the input.
str : | a UTF-8 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is nul-terminated. |
items_written : | location to store the number of characters in the result, or NULL. |
Returns : | a pointer to a newly allocated UCS-4 string. This value must be freed with g_free(). |
gunichar* g_utf16_to_ucs4 (const gunichar2 *str, glong len, glong *items_read, glong *items_written, GError **error); |
Convert a string from UTF-16 to UCS-4. The result will be terminated with a 0 character.
str : | a UTF-16 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is terminated with a 0 character. |
items_read : | location to store number of words read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. |
items_written : | location to store number of characters written, or NULL. The value stored here does not include the trailing 0 character. |
error : | location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. |
Returns : | a pointer to a newly allocated UCS-4 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. |
gchar* g_utf16_to_utf8 (const gunichar2 *str, glong len, glong *items_read, glong *items_written, GError **error); |
Convert a string from UTF-16 to UTF-8. The result will be terminated with a 0 byte.
str : | a UTF-16 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is terminated with a 0 character. |
items_read : | location to store number of words read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. |
items_written : | location to store number of bytes written, or NULL. The value stored here does not include the trailing 0 byte. |
error : | location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. |
Returns : | a pointer to a newly allocated UTF-8 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. |
gunichar2* g_ucs4_to_utf16 (const gunichar *str, glong len, glong *items_read, glong *items_written, GError **error); |
Convert a string from UCS-4 to UTF-16. A 0 word will be added to the result after the converted text.
str : | a UCS-4 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is terminated with a 0 character. |
items_read : | location to store number of bytes read, or NULL. If an error occurs then the index of the invalid input is stored here. |
items_written : | location to store number of words written, or NULL. The value stored here does not include the trailing 0 word. |
error : | location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. |
Returns : | a pointer to a newly allocated UTF-16 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. |
gchar* g_ucs4_to_utf8 (const gunichar *str, glong len, glong *items_read, glong *items_written, GError **error); |
Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.
str : | a UCS-4 encoded string |
len : | the maximum length of str to use. If len < 0, then the string is terminated with a 0 character. |
items_read : | location to store number of characters read read, or NULL. |
items_written : | location to store number of bytes written or NULL. The value here stored does not include the trailing 0 byte. |
error : | location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. |
Returns : | a pointer to a newly allocated UTF-8 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. |
Convenience functions for converting between UTF-8 and the locale encoding.