Project

ramparts

0.0
No commit activity in last 3 years
No release in over 3 years
Parses blocks of text to find phone numbers (including phonetic numbers), emails, and bad url. Useful for finding scammers who tend to try to post their phone number in messages.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 2.5.0, ~> 2.5
~> 0.51.0
~> 0.15.1
 Project Readme

Ramparts - Spam Detection Build Status Maintainability Gem Version

Parses blocks of text to find phone numbers (including phonetic numbers), emails, and spammer urls

Example

Find obfuscated phone numbers

>> message = "Contact me directly ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE  ). Hope you cracked that number code."
>> Ramparts.find_phone_numbers(message)
[{start_offset: 22, end_offset: 71, type: :phone, value: 'FOUR ONE FIVE E I G H T 9 FOUR TOO EIGHT SIX FIVE'}]

Find obfuscated emails.

>> message = "Looking for honest worker .. contact ashley73299 AT yahoo dot com for more info"
>> Ramparts.find_emails(message)
[{start_offset: 37, end_offset: 65, type: :email, value: 'ashley73299 AT yahoo dot com'}]

Find both obfuscated emails and phone numbers.

>> message = "Looking for honest worker .. contact ashley73299 AT yahoo dot com or FOUR FIVE ONE 456 8900 for more info"
>> Ramparts.find_phone_numbers_and_emails(message)
[{start_offset: 37, end_offset: 65, type: :email, value: 'ashley73299 AT yahoo dot com'}, {start_offset: 70, end_offset: 92, type: :phone, value: 'FOUR FIVE ONE 456 8900'}]

Count the occurrences of well known spam URLs and keywords

>> message = ""cialis vs viagra spam guestbook.php?action=http://cialiswalmart.shop""
>> Ramparts.count_urls(message)
3

Installation

In the root directory of your project

gem install ramparts

Remember to require ramparts as necessary

require 'ramparts'

API

count_phone_numbers(text, options = {})

  • Returns the count of the number of phone numbers in the text. Currently uses a map reduce paradigm, which incurs information loss but is cleaner to implement, achieves better results, and is ~2x faster than find_phone_numbers
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
  • Output:
    • number of occurrences of phone numbers [Integer]
  • Example
    • Input:
      • text → "If you're interested in this position, do contact me directly on my phone number ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE ). Hope you cracked that number code."
    • Output: 1

find_phone_numbers(text, options = {})

  • Description: Finds all occurrences of emails within a block of text. Even when l33t speak, phonetics and space variations are used.
  • Input:
    • text [String]
    • options [Hash]
      • To Be Implemented
  • Output:
    • [Array]
      • match [Hash]
        • offset: [Integer]
        • value: [String]
  • Example
    • Input:
      • text → "If you're interested in this position, do contact me directly on my phone number ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE ). Hope you cracked that number code."
    • Output: [{start_offset: 84, end_offset: 133, type: :phone, value: 'FOUR ONE FIVE E I G H T 9 FOUR TOO EIGHT SIX FIVE'}]

replace_phone_numbers(text, options = {}, &block)

  • Description: Replaces all the occurrences of phone numbers within the text with what is returned in the block. Returns the redacted text. of text.
  • Input:
    • text [String]
    • insertable [String]
    • options [Hash]
      • To Be Implemented
  • Output:
    • updated text [String]
  • Example
    • Usage: altered_text = replace_phone_numbers(...) do CENSORED end
    • Input:
      • text → "If you're interested in this position, do contact me directly on my phone number ( FOUR ONE FIVE E I G H T 9 FOUR TWO EIGHT SIX FIVE ). Hope you cracked that number code."
    • Output: "If you're interested in this position, do contact me directly on my phone number ( CENSORED ). Hope you cracked that number code."

count_emails(text, options = {})

  • Description: Returns the count of the number of emails in the text. Currently uses a map reduce paradigm, which incurs information loss but is cleaner to implement, achieves better results, and is ~2x faster than find_emails
  • Input:
    • text [String]
    • options [Hash]
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
  • Output:
    • number of occurences of emails [Integer]
  • Example
    • Input:
      • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, I repeat ashley73299 @ yahoo . com ?.. Ashley"
    • Output: 2

find_emails(text, options = {})

  • Description: Finds all occurrences of emails within a block of text. Even when l33t speak, phonetics are used.
  • Input:
    • text [String]
    • options [Hash]
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • [Array]
      • match [Hash]
        • offset: [Integer]
        • value: [String]
  • Example
    • Input:
      • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, I repeat ashley73299 @ yahoo . com ?.. Ashley"
    • Output: [{start_offset: 78, end_offset: 106, type: :email, value: 'ashley73299 AT yahoo dot com'}, {start_offset: 118, end_offset: 143, type: :email, value: 'ashley73299 @ yahoo . com'}]

replace_emails(text, options = {}, &block)

  • Description: Replaces all the occurrences of emails within the text with what is returned in the block. Returns the redacted text of text.
  • Input:
    • text [String]
    • options [Hash]
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • updated text [String]
  • Example
    • Usage: altered_text = replace_emails(...) do CENSORED end
    • Input:
      • text → "My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( jbush042@gmail.com ) She will be waiting to hear from you kindly send her an email now!"
    • Output: My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( CENSORED ) She will be waiting to hear from you kindly send her an email now!

count_phone_numbers_and_emails(text, options = {})

  • Description: Returns the count of the number of emails in the text. Currently uses a map reduce paradigm, which incurs information loss but is cleaner to implement, achieves better results, and is ~2x faster than find_emails
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • number of occurences of emails [Integer]
  • Example
    • Input:
      • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, phone 416 090 78 NINE 5 ?.. Ashley"
    • Output: 2

find_phone_numbers_and_emails(text, options = {})

  • Description: Finds all occurrences of phone numbers and emails within a block of text.
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • [Array]
      • match [Hash]
        • offset: [Integer]
        • value: [String]
  • Example
    • Input:
      • text → "Hi, Are you seriously interested ..Looking for honest worker .. My e-mail is ashley73299 AT yahoo dot com, phone 416 090 78 NINE 5 ?.. Ashley"
    • Output: [{start_offset: 78, end_offset: 106, type: :email, value: 'ashley73299 AT yahoo dot com'}, {start_offset: 115, end_offset: 132, type: :phone, value: 'FOUR FIVE ONE 456 8900'}]

replace_phone_numbers_and_emails(text, options = {}, &block)

  • Description: Replaces all the occurrences of phone numbers and emails within the text with what is returned from the block. Returns the redacted text of text.
  • Input:
    • text [String]
    • options [Hash]
      • parse_leet [Boolean][Default → True]
        • Parses phone numbers that contain l33t syntax. With this set to true eg. FivE 4 3 F0r On3 67 NiN3 would be caught.
      • remove_spaces [Boolean][Default → True]
        • Parses phone numbers that contain spaces between the numbers. With this set to true eg. F i v E 4 3 F 0 r O n 3 67 N i N 3 would be caught.
      • aggressive [Boolean] [Default → False]
        • doesn't require a . or dot + a TLD at the end, but instead compares the last word against a well known list of email domains (eg. contact ashley @ yandex for more info would be caught)
      • check_for_at [Boolean] [Default → False]
        • checks for the word 'at' as '@', currently can result in algorithm being overly greedy as 'at' is such a common word
  • Output:
    • updated text [String]
  • Example
    • Usage: altered_text = replace_phone_numbers_and_emails(...) do CENSORED end
    • Input:
      • text → "My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( jbush042@gmail.com or FOUR FIVE ONE 789 4568 ) She will be waiting to hear from you kindly send her an email now!"
    • Output: My name is Cynthia, a friend of mine needs a nanny to watch her baby in your area, her contact is ( CENSORED or CENSORED ) She will be waiting to hear from you kindly send her an email now!

count_urls(text, options = {})

  • Description: Simple union regex to find if the text contains bad urls eg. viagra/cialis. Returns a count of the number of occurrences. appear in the text.
  • Input:
    • text [String]
    • options [Hash]
      • To Be Implemented
  • Output:
    • number of occurences of matches [Integer]
  • Example
    • Input:
      • text → "cialis vs cialis spam guestbook.php?action=http://cialiswalmart.shop"
    • Output: 3