Termium Ruby gem
Purpose
The Termium Ruby gem parses export data formats from the TERMIUM Plus terminology database service from the Government of Canada.
WARNING - Termium XML output requires manual correction
The default Termium XML output is invalid where the term domains using angular brackets have the "greater than" sign not escaped:
<textualSupport order="1" type="DEF">
<value><artificial intelligence> operation that allows the firing of a rule, or the
invocation of a program or a subprogram</value>
<sourceRef order="1" />
</textualSupport>
The remedy is to manually escape the "greater than" sign using a find/replace or a regular expression:
string.gsub(/<([^>]+)>/, '<\1>')
Results in:
<textualSupport order="1" type="DEF">
<value><artificial intelligence> operation that allows the firing of a rule, or the
invocation of a program or a subprogram</value>
<sourceRef order="1" />
</textualSupport>
Commands
termium convert
-
Convert a TERMIUM Plus export XML file to a Paneron Glossarist dataset.
termium convert
Purpose
This command converts a TERMIUM Plus export XML (<ns2:termium_extract>
) file
to a Paneron Glossarist dataset.
The resulting dataset will look like this:
{OUTPUT_PATH}/
├── concepts/
│ ├── {CONCEPT_ID}.yaml
│ ├── ...
├── localized_concepts/
├── {LOCALIZED_CONCEPT_ID}.yaml
├── ...
Usage
$ termium convert -i INPUT_XML_FILE [-o OUTPUT_PATH] [-o DATE_ACCEPTED]
Options
Flag | Description |
---|---|
|
Source path to TERMIUM Plus XML export file.
The file needs to start with the |
|
Destination path to Glossarist dataset directory.
If the directory doesn’t exist it will be created.
If not provided, defaults to the basename of the input file, e.g. |
|
Date of acceptance for the dataset. This fills in the |
Examples
The data structures of these files can be seen in the following examples.
{CONCEPT_ID}.yaml
This is 88a7dd87-6199-3516-9cec-f4cd79ff09c6.yaml
.
---
data:
identifier: '2120638'
localized_concepts:
eng: e114ee44-e601-5623-9099-48cfc2be2224
fre: 9a7b88cb-4ee6-5d59-89bb-230425a3c96a
related: []
date_accepted: 2015-05-01
status: valid
id: 88a7dd87-6199-3516-9cec-f4cd79ff09c6
{LOCALIZED_CONCEPT_ID}.yaml
This is e114ee44-e601-5623-9099-48cfc2be2224.yaml
.
---
data:
dates: []
definition:
- content: layer whose nodes directly communicate with external systems
examples: []
id: '2120638'
notes:
- content: 'visible layer: term and definition standardized by ISO/IEC [ISO/IEC
2382-34:1999].'
- content: 34.02.09 (2382)
sources:
- origin:
ref: ISO/IEC 2382-34:1999
type: lineage
status: identical
- origin:
ref: Ranger, Natalie * 2006 * Bureau de la traduction / Translation Bureau *
Services linguistiques / Linguistic Services * Bur. dir. Centre de traduction
et de terminologie / Dir's Office Translation and Terminology Centre * Div.
Citoyenneté et Protection civile / Citizen. & Emergency preparedness Div.
* Normalisation terminologique / Terminology Standardization
type: lineage
status: identical
terms:
- type: expression
normative_status: preferred
designation: visible layer
grammar_info:
- preposition: false
participle: false
adj: false
verb: false
adverb: false
noun: false
gender: []
number:
- singular
language_code: eng
Library
Usage
This gem makes heavy use of the lutaml-model
classes for XML serialization.
The following code converts the Termium extract into a Glossarist dataset.
termium_extract = Termium::Extract.from_xml(IO.read(termium_extract_file))
glossarist_col = termium_extract.to_concept
FileUtils.mkdir_p(glossarist_output_file)
glossarist_col.save_to_files(glossarist_output_file)
Credits
This gem is developed, maintained and funded by Ribose Inc.
License
The gem is available as open source under the terms of the 2-Clause BSD License.