b4msa

https://travis-ci.org/INGEOTEC/b4msa.svg?branch=master https://ci.appveyor.com/api/projects/status/y8vwd9998p74hw0n/branch/master?svg=true https://coveralls.io/repos/github/INGEOTEC/b4msa/badge.svg?branch=master https://anaconda.org/ingeotec/b4msa/badges/version.svg https://badge.fury.io/py/b4msa.svg https://anaconda.org/ingeotec/b4msa/badges/downloads.svg https://readthedocs.org/projects/b4msa/badge/?version=latest

b4msa is multilingual framework, that can be served as a baseline for sentiment analysis classifiers, as well as a starting point to build new sentiment analysis systems.

b4msa extends our work on creating a text classifier (see microTC) by incorporating different language dependent techniques such as:

  • Stemming

  • Stopword

  • Negations

b4msa is described in A Simple Approach to Multilingual Polarity Classification in Twitter. Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, Daniela Moctezuma, Ranyart R. Suárez, Oscar S. Siordia. Pattern Recognition Letters.

Citing

If you find b4msa useful for any academic/scientific purpose, we would appreciate citations to the following reference:

  @article{b4msa,
title = {A {Simple} {Approach} to {Multilingual} {Polarity} {Classification} in {Twitter}},
issn = {0167-8655},
url = {http://www.sciencedirect.com/science/article/pii/S0167865517301721},
doi = {10.1016/j.patrec.2017.05.024},
abstract = {Recently, sentiment analysis has received a lot of attention due to the interest in mining opinions of social media users. Sentiment analysis consists in determining the polarity of a given text, i.e., its degree of positiveness or negativeness. Traditionally, Sentiment Analysis algorithms have been tailored to a specific language given the complexity of having a number of lexical variations and errors introduced by the people generating content. In this contribution, our aim is to provide a simple to implement and easy to use multilingual framework, that can serve as a baseline for sentiment analysis contests, and as a starting point to build new sentiment analysis systems. We compare our approach in eight different languages, three of them correspond to important international contests, namely, SemEval (English), TASS (Spanish), and SENTIPOLC (Italian). Within the competitions, our approach reaches from medium to high positions in the rankings; whereas in the remaining languages our approach outperforms the reported results.},
urldate = {2017-05-24},
journal = {Pattern Recognition Letters},
author = {Tellez, Eric S. and Miranda-Jiménez, Sabino and Graff, Mario and Moctezuma, Daniela and Suárez, Ranyart R. and Siordia, Oscar S.},
keywords = {Error-robust text representations, Multilingual sentiment analysis, Opinion mining},
year = {2017}
}

Installing b4msa

b4msa can be easly install using anaconda

conda install -c ingeotec b4msa

or can be install using pip, it depends on numpy, scipy and scikit-learn.

pip install numpy
pip install scipy
pip install scikit-learn
pip install microtc
pip install nltk
pip install b4msa

Text Model

b4msa extends our work on creating a text classifier (specifically microtc.textmodel.TextModel) by incorporating different language dependant techniques.

class b4msa.textmodel.TextModel(docs=None, threshold=0, lang=None, negation=None, stemming=None, stopwords=None, **kwargs)[source]
Parameters
  • docs (lst) – Corpus

  • threshold (float) – Threshold to remove those tokens less than 1 - entropy

  • lang (str) – Language (spanish | english | italian | german)

  • negation (bool) – Negation

  • stemming (bool) – Stemming

  • stopwords (str) – Stopwords (none | group | delete)

Usage:

>>> from b4msa.textmodel import TextModel
>>> corpus = ['buenos dias', 'catedras conacyt', 'categorizacion de texto ingeotec']
>>> textmodel = TextModel().fit(corpus)

Represent a text into a vector

>>> vector = textmodel['cat']

Train a classifier

>>> from sklearn.svm import LinearSVC
>>> y = [1, 0, 0]
>>> textmodel = TextModel().fit(corpus)
>>> m = LinearSVC().fit(textmodel.transform(corpus), y)
>>> m.predict(textmodel.transform(corpus))
array([1, 0, 0])
classmethod default_parameters(lang=None)[source]

Default parameters per language

>>> from b4msa.textmodel import TextModel
>>> TextModel.default_parameters()
{'token_list': [-2, -1, 2, 3, 4]}
>>> _ = TextModel.default_parameters(lang='arabic')
>>> k = list(_.keys())
>>> k.sort()
>>> [(i, _[i]) for i in k]
[('del_punc', True), ('ent_option', 'delete'), ('negation', False), ('stemming', False), ('stopwords', 'delete'), ('token_list', [-1, 2, 3, 4])]
>>> _ = TextModel.default_parameters(lang='english')
>>> k = list(_.keys())
>>> k.sort()
>>> [(i, _[i]) for i in k]
[('del_diac', False), ('negation', False), ('num_option', 'delete'), ('stemming', False), ('stopwords', 'none'), ('token_list', [[3, 1], -2, -1, 3, 4])]
>>> _ = TextModel.default_parameters(lang='spanish')
>>> k = list(_.keys())
>>> k.sort()
>>> [(i, _[i]) for i in k]
[('negation', False), ('stemming', False), ('stopwords', 'none'), ('token_list', [[2, 1], -1, 2, 3, 4, 5, 6])]
fit(X)[source]

Train the model

Parameters

X (lst) – Corpus

Return type

instance

classmethod params()[source]

Parameters

>>> from b4msa.textmodel import TextModel
>>> TextModel.params()
['docs', 'threshold', 'lang', 'negation', 'stemming', 'stopwords', 'kwargs', 'docs', 'text', 'num_option', 'usr_option', 'url_option', 'emo_option', 'hashtag_option', 'ent_option', 'lc', 'del_dup', 'del_punc', 'del_diac', 'token_list', 'token_min_filter', 'token_max_filter', 'select_ent', 'select_suff', 'select_conn', 'weighting']
text_transformations(text)[source]

Language dependent transformations

Parameters

text (str) – text

Return type

str