A long-lived project that still receives updates
[Unicode 16.0.0] Determine which Unicode "General Categories" a string belongs to
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

Unicode::Categories [version] [ci]

Returns a list which General Categories a Unicode string belongs to.

Unicode version: 16.0.0 (September 2024)

Supported Rubies: 3.3, 3.2, 3.1, 3.0

Old Rubies that might still work: 2.7, 2.6, 2.5, 2.4, 2.3, 2.X

Gemfile

gem "unicode-categories"

Usage

require "unicode/categories"

# All general categories of a string
Unicode::Categories.categories("A 2") # => ["Lu", "Nd", "Zs"]
Unicode::Categories.categories("A 2", format: :long)
# => ["Decimal_Number", "Space_Separator", "Uppercase_Letter"]

# Also aliased as .of
Unicode::Categories.of("\u{10c50}") # => ["Cn"]

# Single character
Unicode::Categories.category("☼", format: :long) # => "Other_Symbol"

The list of categories is always sorted alphabetically.

Hints

Regex Matching

If you have a string and want to match a substring/character from a specific Unicode block, you actually won't need this gem. Instead, you can use the Regexp Unicode Property Syntax \p{}:

"Find decimal numbers (like 2 or 3) within a string".scan(/\p{Nd}+/) # => ["2", "3"]

See Idiosyncratic Ruby: Proper Unicoding for more info.

List of General Categories

You can retrieve a list of all General Categories like this:

require "unicode/categories"
puts \
  "Short | Long\n" +
  "------|-----\n" +
  Unicode::Categories.names(format: :table).to_a.map{ |r| "   %s | %s" % r }.join("\n")
Short Long
Cc Control
Cf Format
Cn Unassigned
Co Private_Use
Cs Surrogate
LC Cased_Letter
Ll Lowercase_Letter
Lm Modifier_Letter
Lo Other_Letter
Lt Titlecase_Letter
Lu Uppercase_Letter
Mc Spacing_Mark
Me Enclosing_Mark
Mn Nonspacing_Mark
Nd Decimal_Number
Nl Letter_Number
No Other_Number
Pc Connector_Punctuation
Pd Dash_Punctuation
Pe Close_Punctuation
Pf Final_Punctuation
Pi Initial_Punctuation
Po Other_Punctuation
Ps Open_Punctuation
Sc Currency_Symbol
Sk Modifier_Symbol
Sm Math_Symbol
So Other_Symbol
Zl Line_Separator
Zp Paragraph_Separator
Zs Space_Separator

See unicode-x for more Unicode related micro libraries.

MIT License