Strings and Runes in Go
January 27, 2021
Reading:
- https://blog.golang.org/slices
- https://blog.golang.org/strings
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Of course nothing that seems simple is actually so.
In golang a string is a slice of type []byte
The bytes in this slice could be UTF-8 or something else. There’s no way to know for sure unless you created it from Go string literals - then its UTF-8 as all Go Code is encoded as UTF-8. This is why we get the remark “go strings are utf8” which is not accurate.
There’s a lot of ambiguity around the term character and its Unicode name “code point”. So Go creates a new term called “rune” best I can tell rune was determined to be a less confusing word then code point so the Go authors used that.
Whats interesting is the len() and range operaters in Go handle the string differently unless you specifically cast the string to a specific slice type:
func main() {
const nihongo = "日本語"
for index, runeValue := range nihongo {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
fmt.Printf("Len is: %d", len(nihongo))
}
Prints out:
U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6
Len is: 9
This is cute, the range and len operaters work differently…
To get the “expected” result you need to cast your string to a []rune:
func main() {
const nihongo = "日本語"
for index, runeValue := range nihongo {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
fmt.Printf("len(nihongo) is: %d\n", len(nihongo))
fmt.Printf("len([]rune(nihongo)) is: %d\n", len([]rune(nihongo)))
}
U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6
len(nihongo) is: 9
len([]rune(nihongo)) is: 3
An aside on character encoding, ASCII, ANSI and UTF-8.
ASCII 7 bit, 0-127 characters that our englist speaking unix overlords intended.
ANSI a formalization of clever things done with 128 - 256 (the extra space in the two ASCII bytes) to support characters in other langauges.
Unicode - 2 byte (16 bit) storage for characters. Which are known as “code points”.
UTF-8 is a system for storing Unicode code points. But It has clever compatibly with ASCII
UTF-8 stores “code points” in between One and Six bytes. What’s clever is that it supports single byte and the ASCII characters chill out in this range. So ASCII is foward compatiable with UTF-8