No commit activity in last 3 years
No release in over 3 years
Wikipedia vandalism detection with JRuby.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

 Project Readme

Wikipedia Vandalism Detection

Vandalism detection on the Wikipedia history with JRuby v9.1.0.0+.

The Wikipedia Vandalism Detection Gem uses the Weka Machine-Learning Library via the weka gem.

Gem Version Build Status

What You can do with it

  • parsing Wikipedia history pages to get edits and revisions

  • creating training and test ARFF files from the WVC-PAN-10 and the WVC-PAN-11 (See also http://pan.webis.de under category Wikipedia Vandalism Detection: CLEF 2010 & CLEF 2011)

  • calculating vandalism features for a Wikipedia page (XML) from the history dump

  • creating and evaluating a classifier with the created training ARFF file

  • classifing new instances of Wikipedia edits as 'regular' or 'vandalism'

Installation

Add this line to your application's Gemfile:

gem 'wikipedia-vandalism_detection'

And then execute:

$ bundle

Or install it yourself as:

$ gem install wikipedia-vandalism_detection

Usage

require 'wikipedia/vandalism_detection'

Configuration

To configure the system put a wikipedia-vandalism-detection.yml file in the config/ or lib/config/ directory.

You can configure:

A) the training and test corpora directories and essential input and output files

corpora:
  base_directory: /home/user/corpora

  training:
    base_directory: training
    annotations_file: annotations.csv
    edits_file: edits.csv
    revisions_directory: revisions

  test:
    base_directory: test
    edits_file: edits.csv
    revisions_directory: revisons

output:
  base_directory: /home/user/output_path
  training:
    arff_file: training.arff
    index_file: training_index.yml
  test:
    arff_file: test.arff
    index_file: test_index.yml

Evaluation outputs are saved under the output base directory path.

B) the features used by the feature calculator

features:
  - anonymity
  - biased frequency
  - character sequence
  - ...

C) the classifier type and its options and the number of cross validation splits for the classifier evaluation

classifier:
  type: Trees::RandomForest         # Weka classifier class
  options: -I 10 -K 0.5             # same as for Weka, for further classifier options see Weka-dev documentation
  cross-validation-fold: 5          # default is 10
  training-data-options: balanced   # default is unbalanced

training-data-options is used to resample the training dataset:

  • unbalanced is the default value and uses the original dataset
  • balanced uses random undersampling of the majority class
  • oversampled uses SMOTE oversampling (with percentage -p) and random undersampling (with minority/majority class balance -u)

Examples:

# 200% SMOTE oversampling with 300% random undersampling
training-data-options: oversampled -p 200 -u true 300

# default 100% SMOTE oversampling with 300% random undersampling
training-data-options: oversampled -u true 300

# 200% SMOTE oversampling with default full (100% minority/majority class balance)
# random undersampling
training-data-options: oversampled -p 200

# default 100% SMOTE oversampling without undersampling
training-data-options: oversampled -u false

Instead of the true option you can also use t, y and yes as well as their upper case pendants.

Examples

Create training and test ARFF file from configured corpus:

training_dataset = Wikipedia::VandalismDetection::TrainingDataset.build
test_dataset = Wikipedia::VandalismDetection::TestDataset.build

While creating the training and test datasets, for each a corpus file index is created into the configured index_file directory. To run the corpus file index creation manually use:

Wikipedia::VandalismDetection::TrainingDataset.create_file_index!
Wikipedia::VandalismDetection::TestDataset.create_file_index!

Parse a Wikipedia page content:

At the moment no namespaces are supported while parsing a page. So, the <page>...</page> tags should not be included in a namespaced xml tag!

xml = File.read(wikipedia_page.xml)
parser = Wikipedia::VandalismDetection::PageParser.new
page = parser.parse(xml)

# Work with revisions and edits from the page
page.revisions.each do |revision|
  puts revison.id
  puts revison.parent_id
end

page.edits.each do |edit|
  puts edit.new_revision.id
  puts edit.old_revision.id
end

Use a classifier of configured type:

Create the classifier:

classifier = Wikipedia::VandalismDetection::Classifier.new

Evaluation of the classifier against the configured training corpus:

# classifier.classifier_instance returns the weka classifier instance
evaluation = classifier.classifier_instance.cross_validate(folds: 10)
puts evaluation.class_details

Classify a new edit:

# Classification of a Wikipedia Edit or a feature set
# 'edit' is a Wikipedia::VandalismDetection::Edit, this can be built manually or by
# parsing a Wikipedia page content and getting its edits
# The returned confidence is a value between 0.0 and 1.0 were 0.0 means 'regular' and 1.0 means 'vandalism'
confidence = classifier.classify(edit)

feature_calculator = Wikipedia::VandalismDetection::FeatureCalculator.new
features = feature_calculator.calculate_features_for(edit)
confidence = classifier.classify(features)

Evaluate test corpus classification:

evaluator = classifier.evaluator
# or create a new evaluator
evaluator = Wikipedia::VandalismDetection::Evaluator.new(classifier)

performance_data = evaluator.evaluate_testcorpus_classification #default sample_count = 100
performance_data = evaluator.evaluate_testcorpus_classification(sample_count: 200)

# following attributes can be used for further computations
recall_values = performance_data[:recalls]           # recall values for e.g. x-values of PRC or y-values of ROC
precision_values = performance_data[:precisions]     # precision values for e.g. y-values of PRC
fp_rate_values = performance_data[:fp_rates]         # false positive rate values for e.g. x-values of ROC
area_under_curve_pr = performance_data[:pr_auc]      # computed from the precision and recall values
area_under_curve_ro = performance_data[:roc_auc]     # computed from the recall and fp-rate values
total_recall = performance_data[:total_recall]       # precison and recall values with maximum area (rectangle area)
total_precision = performance_data[:total_precision]

Get each features predictive value for analysis:

evaluator = classifier.evaluator
# or create a new evaluator
evaluator = Wikipedia::VandalismDetection::Evaluator.new(classifier)

analysis_data = evaluator.feature_analysis #default sample_count = 100
analysis_data = evaluator.feature_analysis(sample_count: 1000)

This returns a hash comprising all feature names as configured as keys and the threshold hashes as values.

{
  feature_name_1:
    {
      0.0 => {fp: , fn: , tp: , tn: },
      ,
      1.0 => {fp: , fn: , tp: , tn: }
    },
  ,
  feature_name_n:
    {
      0.0 => {fp: , fn: , tp: , tn: },
      ,
      1.0 => {fp: , fn: , tp: , tn: }
    },
}

Creating new Features:

You can define your own new Feature classes and use them by configuration in the config.yml.

Make sure to define the Feature class inside of the Wikipedia::VandalismDetection::Features module and to implement the calculate method (also refer to the Wikipedia::VandalismDetection::Features::Base class definition).

module Wikipedia
  module VandalismDetection
    module Features
      class MyNewFeature < Base
        def calculate(edit)
          super # ensures raising an error if 'edit' is not an Edit.

          # ...your implementation
        end
      end
    end
  end
end

While creating new Feature classes you should be aware of the following naming convention: The feature's name in the config.yml is the downcased name with spaces or dashes of the feature class name

E.g.:

    features:
      - my new feature
      - my-new-feature

both search for a Feature class with the name MyNewFeature.

Contributing

  1. Fork it ( http://github.com/paulgoetze/wikipedia-vandalism_detection/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request