Introduction ======== Rseg is a Chinese Word Segmentation(中文分词) routine in pure Ruby. The algorithm is based on this article: http://xiecc.blog.163.com/blog/static/14032200671110224190/ Usage ======== Rseg now support two modes: inline and C/S mode. 1. Inline mode > require 'rubygems' > require 'rseg' > Rseg.segment("需要分词的文章") ['需要', '分词', '的', '文章'] The first call to Rseg#segment will need about 30 seconds to load the dictionary, the second call will be very fast, you can also call Rseg#load to load dictionaries manually. 2. C/S mode $ rseg_server == Sinatra/0.9.4 has taken the stage on 4100 This will start rseg server on http://localhost:4100 You can visit it via your browser or the rseg command. $ rseg '需要分词的文章' 需要 分词 的 文章 You can also access server with the Rseg#remote_segment $ irb > require 'rubygems' > require 'rseg' > Rseg.remote_segment("需要分词的文章") # This will be very fast ['需要', '分词', '的', '文章'] Performance ======== About 5M character/s on my Macbook (Intel Core 2 Duo 2GHz/4G mem). License ======== Rseg includes two built-in dictionaries: * CC-CEDICT (http://cc-cedict.org/wiki/) with Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) * Wikipedia Chinese article title list (http://download.wikimedia.org/zhwiki/) with Creative Commons Attribution-Share Alike 3.0 License(http://creativecommons.org/licenses/by-sa/3.0/) The codes and others in Rseg are licensed under MIT license. Feedback ======== All feedback are welcome, Yuanyi Zhang(zhangyuanyi#gmail.com)
Project
rseg_ggharry
A Chinese Word Segmentation(中文分词) routine in pure Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
Pull Requests
Development
Project Readme