Spam Filtering¶
This system provides Spam Filtering through integrating the scikit-learn framework: http://scikit-learn.org.
It provides a pluggable filter for any Django model that is subject to Text Spam.
An example of implementing a Spam Filter into a project can be found in Spam Filtering with SVM (Example 3).
Spam Filter¶
This is the main object for the Spam Filtering System.
-
class
systems.spam_filtering.models.
SpamFilter
(*args, **kwargs)[source]¶ Main object for the Spam Filtering System.
All the configuration can be done through the admin of Spam Filters - or more specifically, through the change form.
Front-End¶
General¶
General fields (like Name
) and Miscellanous are documented in the Statistical Model API.
The implementation uses scikit-learn as Engine, there is no need of setting more than 1 Engine Meta Iterations
.
Spammable Model¶
A Spammable Model is a Django model which inherits from the IsSpammable Abstract Model (discussed below) for convenience of incorparting the model to all the functionality in the Spam Filtering cycle.
-
SpamFilter.
spam_model_is_enabled
Use a Spammable Model?¶ Whether to use a Spammable Model as a data source
-
SpamFilter.
spam_model_model
Spammable Django Model¶ “IsSpammable-Django Model” to be used with the Spam Filter (in the “app_label.model” format, i.e. “examples.CommentOfMySite”)
If you choose not to use an Spammable Model, you can specify where the data is held (Spammable Content and Labels) via the Data Columns and Labels Column sections.
Classifier¶
The Classifier model to be used for discerning the Spam.
Any implementation of a Supervised Learning Technique using a scikit-learn classifier will work.
-
SpamFilter.
classifier
Classifier to be used in the System¶ Classifier to be used in the System, in the “app_label.model|name” format, i.e. “supervised_learning.SVC|My SVM”
Cross Validation¶
Cross Validation (CV) will be used as the perfomance estimation of the Spam Filter. The reported estimation will be the mean and the 2 standard deviations interval of the metrics evaluated in each CV fold.
CV is done with the scikit-learn engine, more general information is available here and here is detailed about the available metrics.
-
SpamFilter.
cv_is_enabled
Enable Cross Validation?¶ A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
-
SpamFilter.
cv_folds
Cross Validation Folds¶ A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
-
SpamFilter.
cv_metric
Cross Validation Metric¶ Metric to be evaluated in Cross Validation
Pre-Training¶
Pre-training refers to providing the model with “initial” data, as “initializating” the model. See Spam Filter Pre-Training for more details.
-
SpamFilter.
pretraining
¶ A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
Bag of Words Representation¶
The Bag of Words representation (BoW) is a suitable representation for many Natural Language Processing problems - such as text classification.
If it is not enabled, the Spam Filter will use the UTF-8 code point representation for the corpus: each character is represented on an axis and its value is its UTF-8 code, i.e. Hola!HOLA!
will be represented as ( 72, 111, 108, 97, 33, 72, 79, 76, 65, 33), and the input dimensionality will be the maximum length of the texts in the corpus.
For more information on the transformation, see the Spam Filtering with SVM (Example 3) and the Engine documentation.
-
SpamFilter.
bow_is_enabled
Enable Bag of Words representation?¶ Enable Bag of Words transformation
-
SpamFilter.
bow_use_tf_idf
(BoW) Use TF-IDF transformation?¶ Use the TF-IDF transformation?
-
SpamFilter.
bow_analyzer
(BoW) Analyzer¶ Whether the feature should be made of word or character n-grams. Option ‘Chars in W-B’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.’
-
SpamFilter.
bow_ngram_range_min
(BoW) n-gram Range - Min¶ The lower boundary of the range of n-values for different n-grams to be extracted. All value of n such that min_n <= n <= max_n will be used.
-
SpamFilter.
bow_ngram_range_max
(BoW) n-gram Range - Max¶ The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
-
SpamFilter.
bow_max_df
(BoW) Maximum Document Frequency¶ A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
-
SpamFilter.
bow_min_df
(BoW) Minimum Document Frequency¶ When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
-
SpamFilter.
bow_max_features
(BoW) Maximum Features¶ If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
Bag of Words Transformation - Miscellanous¶
-
SpamFilter.
bow_binary
(BoW) Use Binary representation?¶ If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
-
SpamFilter.
bow_enconding
(BoW) Encoding¶ Encoding to be used to decode the corpus
-
SpamFilter.
bow_decode_error
(BoW) Decode Error¶ Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.’
-
SpamFilter.
bow_strip_accents
(BoW) Strip Accents¶ Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
-
SpamFilter.
bow_stop_words
(BoW) Stop Words¶ If ‘english’, a built-in stop word list for English is used. If a comma-separated string, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ´word´. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.’
-
SpamFilter.
bow_vocabulary
(BoW) Vocabulary¶ A Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.
API¶
SpamFilter extends the Supervised Learning Technique in several ways.
IsSpammable¶
IsSpammable is a Django Abstract Model (AM) meant to give convenience in the Spam Filtering cycle.
The AM provides the fields, options and .save()
method to attach the model to a Spam Filter.
Once attached to a Spam Filter, the data held in the Django model will be used for training the Filter and the Filter will be used to classify new data created in the model (on .save()
if the Spam Filter is inferred).
-
class
systems.spam_filtering.models.
IsSpammable
(*args, **kwargs)[source]¶ This Abstract Model (AM) is meant to be used in Django models which may recieve Spam.
- Usage:
Make your model inherit from this AM.
Set the SPAM_FILTER constant to the name of the Spam Filter object you would like to use
Set the SPAMMABLE_FIELD to the name of the field which stores the content.
Example:
class CommentsOfMySite(IsSpammable): SPAM_FILTER = "Comment Spam Filter" SPAMMABLE_FIELD = "comment" ... # The rest of your code
Fields and Settings¶
-
IsSpammable.
SPAMMABLE_FIELD
= None¶ Name of the field which stores the Spammable Content
-
IsSpammable.
SPAM_LABEL_FIELD
= 'is_spam'¶ Name of the field which stores the Spam labels
-
IsSpammable.
SPAM_FILTER
= None¶ Name of the Spam Filter object to be used
-
IsSpammable.
is_spam
Is Spam?¶ If the object is Spam - Label of the Object
-
IsSpammable.
is_misclassified
Is Misclassified?¶ If the object has been misclassified by the Spam Filter - useful for some algorithms and for understanding the filter
-
IsSpammable.
is_revised
Is Revised?¶ If the object classification has been revised by a Human - Need for proper training and automation
Usage¶
- Make your model inherit from this AM.
- Choose the Spam Filter to be attached by seting the
SPAM_FILTER
constant to the name of the Spam Filter object. you would like to use - Set the
SPAMMABLE_FIELD
constant to the name of the field which stores the content. - Make and run migrations.
Example¶
class CommentsOfMySite(IsSpammable):
SPAM_FILTER = "Comment Spam Filter"
SPAMMABLE_FIELD = "comment"
... # The rest of your code
Other Considerations¶
Technically, what makes a Django model “pluggable” into a Spam Filter as a source of data for training are:
SPAMMABLE_FIELD
constant which defines the where is the contentSPAM_LABEL_FIELD
constant which defines the field where the label is stored - defaulted toìs_spam
.- A NullBooleanField to store the labels of the objects.
If you do not want ot inherit from the AM, any model with these three defined will work as an Spammable Model in the Spam Filter setup. The only pending thing for completing the systmes is the automation of classification of new objects.
Spam Filter Pre-Training¶
Pre-training refers to providing the model with other data, “external” data, as an initialization. That data is incorporated into the training dataset of the model.
SpamFilterPreTraining is a Django Abstract Model (AM) meant to give convenience in pre-training the Spam Filter.
-
class
systems.spam_filtering.models.
SpamFilterPreTraining
(*args, **kwargs)[source]¶ Abstract Model for pre-training Spam Filters. Subclass this Model for incorporating datasets into the training of a Spam Filter (the subclass must be set in the Spam Filter’s
pretraining
field).
Usage¶
- Create a Django Model that inherits from SpamFilterPreTraining
- Make and run migrations
- Import data to the Django Model
- Set the Spam Filter pre-training field to use the pre-training model
Example¶
class SFPTEnron(SpamFilterPreTraining):
class Meta:
verbose_name = "Spam Filter Pre-Training: Enron Email Data"
verbose_name_plural = "Spam Filter Pre-Training: Enron Emails Data"
Other Considerations¶
Technically, what makes a Django model “pluggable” into a Spam Filter as a source of pre-training are the content
and is_spam
fields, or the SPAMMABLE_FIELD
and SPAM_LABEL_FIELD
constants defined in the class pointing to Text or Char field and a Boolean field respectively.
If you do not want to inherit, define either or both in your Django Model and it will be “pluggable” as a pre-training dataset.