Spam Filtering

This system provides Spam Filtering through integrating the scikit-learn framework: http://scikit-learn.org.

It provides a pluggable filter for any Django model that is subject to Text Spam.

An example of implementing a Spam Filter into a project can be found in Spam Filtering with SVM (Example 3).

Spam Filter

This is the main object for the Spam Filtering System.

class systems.spam_filtering.models.SpamFilter(*args, **kwargs)[source]

Main object for the Spam Filtering System.

All the configuration can be done through the admin of Spam Filters - or more specifically, through the change form.

Front-End

General

General fields (like Name) and Miscellanous are documented in the Statistical Model API.

The implementation uses scikit-learn as Engine, there is no need of setting more than 1 Engine Meta Iterations.

Spammable Model

A Spammable Model is a Django model which inherits from the IsSpammable Abstract Model (discussed below) for convenience of incorparting the model to all the functionality in the Spam Filtering cycle.

SpamFilter.spam_model_is_enabled Use a Spammable Model?

Whether to use a Spammable Model as a data source

SpamFilter.spam_model_model Spammable Django Model

“IsSpammable-Django Model” to be used with the Spam Filter (in the “app_label.model” format, i.e. “examples.CommentOfMySite”)

If you choose not to use an Spammable Model, you can specify where the data is held (Spammable Content and Labels) via the Data Columns and Labels Column sections.

Classifier

The Classifier model to be used for discerning the Spam.

Any implementation of a Supervised Learning Technique using a scikit-learn classifier will work.

SpamFilter.classifier Classifier to be used in the System

Classifier to be used in the System, in the “app_label.model|name” format, i.e. “supervised_learning.SVC|My SVM”

Cross Validation

Cross Validation (CV) will be used as the perfomance estimation of the Spam Filter. The reported estimation will be the mean and the 2 standard deviations interval of the metrics evaluated in each CV fold.

CV is done with the scikit-learn engine, more general information is available here and here is detailed about the available metrics.

SpamFilter.cv_is_enabled Enable Cross Validation?

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

SpamFilter.cv_folds Cross Validation Folds

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

SpamFilter.cv_metric Cross Validation Metric

Metric to be evaluated in Cross Validation

Pre-Training

Pre-training refers to providing the model with “initial” data, as “initializating” the model. See Spam Filter Pre-Training for more details.

SpamFilter.pretraining

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

Bag of Words Representation

The Bag of Words representation (BoW) is a suitable representation for many Natural Language Processing problems - such as text classification.

If it is not enabled, the Spam Filter will use the UTF-8 code point representation for the corpus: each character is represented on an axis and its value is its UTF-8 code, i.e. Hola!HOLA! will be represented as ( 72, 111, 108, 97, 33, 72, 79, 76, 65, 33), and the input dimensionality will be the maximum length of the texts in the corpus.

For more information on the transformation, see the Spam Filtering with SVM (Example 3) and the Engine documentation.

SpamFilter.bow_is_enabled Enable Bag of Words representation?

Enable Bag of Words transformation

SpamFilter.bow_use_tf_idf (BoW) Use TF-IDF transformation?

Use the TF-IDF transformation?

SpamFilter.bow_analyzer (BoW) Analyzer

Whether the feature should be made of word or character n-grams. Option ‘Chars in W-B’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.’

SpamFilter.bow_ngram_range_min (BoW) n-gram Range - Min

The lower boundary of the range of n-values for different n-grams to be extracted. All value of n such that min_n <= n <= max_n will be used.

SpamFilter.bow_ngram_range_max (BoW) n-gram Range - Max

The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

SpamFilter.bow_max_df (BoW) Maximum Document Frequency

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

SpamFilter.bow_min_df (BoW) Minimum Document Frequency

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

SpamFilter.bow_max_features (BoW) Maximum Features

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

Bag of Words Transformation - Miscellanous

SpamFilter.bow_binary (BoW) Use Binary representation?

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

SpamFilter.bow_enconding (BoW) Encoding

Encoding to be used to decode the corpus

SpamFilter.bow_decode_error (BoW) Decode Error

Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.’

SpamFilter.bow_strip_accents (BoW) Strip Accents

Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.

SpamFilter.bow_stop_words (BoW) Stop Words

If ‘english’, a built-in stop word list for English is used. If a comma-separated string, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ´word´. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.’

SpamFilter.bow_vocabulary (BoW) Vocabulary

A Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.

API

SpamFilter extends the Supervised Learning Technique in several ways.

IsSpammable

IsSpammable is a Django Abstract Model (AM) meant to give convenience in the Spam Filtering cycle.

The AM provides the fields, options and .save() method to attach the model to a Spam Filter.

Once attached to a Spam Filter, the data held in the Django model will be used for training the Filter and the Filter will be used to classify new data created in the model (on .save() if the Spam Filter is inferred).

class systems.spam_filtering.models.IsSpammable(*args, **kwargs)[source]

This Abstract Model (AM) is meant to be used in Django models which may recieve Spam.

Usage:
  • Make your model inherit from this AM.

  • Set the SPAM_FILTER constant to the name of the Spam Filter object you would like to use

  • Set the SPAMMABLE_FIELD to the name of the field which stores the content.

  • Example:

    class CommentsOfMySite(IsSpammable):
        SPAM_FILTER = "Comment Spam Filter"
        SPAMMABLE_FIELD = "comment"
        ... # The rest of your code
    

Fields and Settings

IsSpammable.SPAMMABLE_FIELD = None

Name of the field which stores the Spammable Content

IsSpammable.SPAM_LABEL_FIELD = 'is_spam'

Name of the field which stores the Spam labels

IsSpammable.SPAM_FILTER = None

Name of the Spam Filter object to be used

IsSpammable.is_spam Is Spam?

If the object is Spam - Label of the Object

IsSpammable.is_misclassified Is Misclassified?

If the object has been misclassified by the Spam Filter - useful for some algorithms and for understanding the filter

IsSpammable.is_revised Is Revised?

If the object classification has been revised by a Human - Need for proper training and automation

Usage

  • Make your model inherit from this AM.
  • Choose the Spam Filter to be attached by seting the SPAM_FILTER constant to the name of the Spam Filter object. you would like to use
  • Set the SPAMMABLE_FIELD constant to the name of the field which stores the content.
  • Make and run migrations.

Example

class CommentsOfMySite(IsSpammable):
    SPAM_FILTER = "Comment Spam Filter"
    SPAMMABLE_FIELD = "comment"
    ... # The rest of your code

Other Considerations

Technically, what makes a Django model “pluggable” into a Spam Filter as a source of data for training are:

  • SPAMMABLE_FIELD constant which defines the where is the content
  • SPAM_LABEL_FIELD constant which defines the field where the label is stored - defaulted to ìs_spam.
  • A NullBooleanField to store the labels of the objects.

If you do not want ot inherit from the AM, any model with these three defined will work as an Spammable Model in the Spam Filter setup. The only pending thing for completing the systmes is the automation of classification of new objects.

Spam Filter Pre-Training

Pre-training refers to providing the model with other data, “external” data, as an initialization. That data is incorporated into the training dataset of the model.

SpamFilterPreTraining is a Django Abstract Model (AM) meant to give convenience in pre-training the Spam Filter.

class systems.spam_filtering.models.SpamFilterPreTraining(*args, **kwargs)[source]

Abstract Model for pre-training Spam Filters. Subclass this Model for incorporating datasets into the training of a Spam Filter (the subclass must be set in the Spam Filter’s pretraining field).

Usage

  • Create a Django Model that inherits from SpamFilterPreTraining
  • Make and run migrations
  • Import data to the Django Model
  • Set the Spam Filter pre-training field to use the pre-training model

Example

class SFPTEnron(SpamFilterPreTraining):

    class Meta:
        verbose_name = "Spam Filter Pre-Training: Enron Email Data"
        verbose_name_plural = "Spam Filter Pre-Training: Enron Emails Data"
examples.migrations.0015_sfptenron_sfptyoutube.download_and_process_pretrain_data_files(apps, schema_editor)[source]

Forward Operation: Downloads if neccesary the sample data and populates Pre-Train Models.

Other Considerations

Technically, what makes a Django model “pluggable” into a Spam Filter as a source of pre-training are the content and is_spam fields, or the SPAMMABLE_FIELD and SPAM_LABEL_FIELD constants defined in the class pointing to Text or Char field and a Boolean field respectively.

If you do not want to inherit, define either or both in your Django Model and it will be “pluggable” as a pre-training dataset.