Categorização de Textos por Aprendizagem de Máquina.pdf

Introdução Metodologia Resultados Conclusão Referencias
Categorização de Textos por Aprendizagem de
Máquina
Keila Barbosa Costa
keilabarbosa@laccan.ufal.br
Instituto de Computação
Programa de Pós-Graduação em Modelagem Computacional de Conhecimento
Laboratório de Computação Científica e Análise Numérica
Universidade Federal de Alagoas
Orientador: Alejandro C. Frery
Maceió-AL, Julho de 2019
Keila Barbosa Costa Classificação de Texto Maceió-AL, Julho de 2019 1 / 57

Roteiro
Introdução
Problema
Contribuições
Objetivos
Metodologia
Resultados
Conclusão

Delimitando a Área
Inteligência artificial (IA) é a ciência que estuda a modelagem da
inteligência similar à humana exibida por mecanismos ou software.

Aprendizagem Supervisionada
Aprendizagem supervisionada é uma tarefa de aprendizagem de
máquina que consiste em mapear a entrada em uma saída com
base em exemplos dos pares entrada-saída.

Classificação de Texto: Definição
A classificação de texto é um tópico clássico para o processamento
de linguagem natural, no qual é necessário designar categorias
predefinidas para documentos de texto livre (ZHANG; ZHAO;
LECUN, 2015).
Input:
Um Documento d
Um Conjunto fixo de classes C = {c1, c2, . . . , cJ}
Output:
Uma classe prevista c ∈ C

Métodos de Classificação: Aprendizado de Máquina
Supervisionado
Input:
Um Documento d
Um Conjunto fixo de classes C = {c1, c2, . . . , cJ}
Um conjunto de treinamento de m documentos etiquetados
(d1, c1), . . . , (dm, cm)
Output:
Um classificador aprendido γ : d → c

Como funciona a classificação de texto?
Atmosphere
SAR Data
Cryosphere

Sua aplicação mais comum:
Indexação de documentos para os Sistemas de Recupera-
ção de Informações - SRIs (LEWIS, 1992);

Indexação de documentos para os Sistemas de Recuperação de
Informações - SRIs (LEWIS, 1992);
Categorização de mensagens e notícias, de resumos de pu-
blicações, na filtragem e sumarização de textos (GEUT-
NER; BODENHAUSEN; WAIBEL, 1993; ELBERRICHI;
RAHMOUN; BENTAALAH, 2008; HAYES et al., 1990;
HAYES; WEINSTEIN, 1991; MASAND; LINOFF; WALTZ,
1992);

Categorização de mensagens e notícias, de resumos de publica-
ções, na filtragem e sumarização de textos (GEUTNER; BODE-
NHAUSEN; WAIBEL, 1993; ELBERRICHI; RAHMOUN; BEN-
TAALAH, 2008; HAYES et al., 1990; HAYES; WEINSTEIN,
1991; MASAND; LINOFF; WALTZ, 1992);
Detecção de Spam (SILVA; YAMAKAMI; ALMEIDA, 2012;
WU et al., 2017, 2017);

Detecção de Spam (SILVA; YAMAKAMI; ALMEIDA, 2012; WU
et al., 2017, 2017);
Identificação de Comunidades (BARROS et al., 2018);

et al., 2017, 2017);
Identificação de Linguagem (MALMASI; DRAS, 2015;
RANGEL et al., 2017);

et al., 2017, 2017);
Identificação de Linguagem (MALMASI; DRAS, 2015; RANGEL
et al., 2017);
Análise de Sentimentos (PANG; LEE; VAITHYANATHAN,
2002; GO; BHAYANI; HUANG, 2009);

et al., 2017, 2017);
Identificação de Linguagem (MALMASI; DRAS, 2015; RANGEL
et al., 2017);
Análise de Sentimentos (PANG; LEE; VAITHYANATHAN, 2002;
GO; BHAYANI; HUANG, 2009);
...

O problema
Como construir o índice da Revista IEEE
Geoscience and Remote Sensing Letters
de forma automatizada com menor perda
de precisão?
Quão preciso oferece os algoritmos de
aprendizado de máquina para classificação
de documentos como os tratados neste
trabalho?
Problemas associados à classificação de
múltiplas classes em um conjunto de
dados desbalanceado.

Contribuições
O IEEE Geoscience and Remote Sensing Letters (GRSL) é uma
publicação mensal de artigos curtos que aborda novas ideias e
conceitos formativos em sensoriamento remoto.
O seu índice é realizado de forma manual pelo Editor-Chefe.
Desse modo, com a utilização do método será
possível:
Reduzir o tempo gasto com trabalhos
manuais.
Acelerar o processo, resultando em
benefícios para IEEEGRSS, a sociedade
responsável pela edição do periódico
IEEEGRSL e a comunidade acadêmica.

Contribuições do ponto de vista científico
A eficácia dos classificadores
automatizados não é impecável;
Realizando o levantamento na base Web
of Science foi encontrados 197 artigos que
possuíam relevância em seu título nos anos
de 2016 até 2018 referindo-se ao tema;
Classificadores de documentos com grande
número de classes constitui uma área
ativa e relevante.

Objetivo
Geral
O objetivo deste trabalho é a comparação de
abordagens computacional para classificar
automaticamente os documentos de texto em
uma categoria predefinida usando aprendizado
de máquina com ênfase em aprendizagem
profunda (Deep Learning).

Objetivos
AUGUST 2018 VOLUME 15 NUMBER 8 IGRSBY (ISSN 1545-598X)
PAPERS
Methodologies and Applications to:
Atmosphere
UV Transient Atmospheric Events Observed Far From Thunderstorms by the Vernov Satellite ......... P. A. Klimov,
M. A. Kaznacheeva, B. A. Khrenov, G. K. Garipov, V. V. Bogomolov, M. I. Panasyuk, S. I. Svertilov, and R. Cremonini 1139
Oceans and Water
Sea State Bias of ICESat in the Subarctic Seas ..................................................................................
................................... J. Morison, R. Kwok, S. Dickinson, D. Morison, C. Peralta-Ferriz, and R. Andersen 1144
Vegetation and Land Surface
Algorithms for Calculating Topographic Parameters and Their Uncertainties in Downward Surface Solar Radiation
(DSSR) Estimation .............................................. S. Wu, J. Wen, D. You, H. Zhang, Q. Xiao, and Q. Liu 1149
Surface and Subsurface Properties
Resistivity-Based Temperature Estimation of the Kakkonda Geothermal Field, Japan, Using a Neural Network and
Neural Kriging ............................ K. Ishitsuka, T. Mogi, K. Sugano, Y. Yamaya, T. Uchida, and T. Kajiwara 1154
A Novel Approach for Seismic Time-Frequency Analysis Based on High-Order Synchrosqueezing Transform .......
................................................................. W. Liu, S. Cao, Z. Wang, K. Jiang, Q. Zhang, and Y. Chen 1159
An Iterative Zero-Offset VSP Wavefield Separating Method Based on the Error Analysis of SVD Filtering ...........
....................................................................................... X. Wang, J. Chen, L. Gao, and W. Chen 1164
Effects of Shadow and Source Overprint on Grounded-Wire Transient Electromagnetic Response ......................
.................................................................................................... N. Zhou, D. Hou, and G. Xue 1169
Modeling the Effect of Microscopic and Mesoscopic Heterogeneities on Frequency-Dependent Attenuation and Seismic
Signatures ........................................................... Y.-X. He, X.-Y. Wu, K. Fu, D. Zhou, and S.-X. Wang 1174
Optimization of RFM’s Structure Based on PSO Algorithm and Figure Condition Analysis .............................
......................................... S. H. Alizadeh Moghaddam, M. Mokhtarzade, and S. A. Alizadeh Moghaddam 1179
Semantic Labeling Using a Low-Power Neuromorphic Platform ............ J. Tang, B. S. Mashford, and A. J. Yepes 1184
Animal Lameness Detection With Radar Sensing ................................................................ A. Shrestha,
C. Loukas, J. Le Kernec, F. Fioranelli, V. Busin, N. Jonsson, G. King, M. Tomlinson, L. Viora, and L. Voute 1189
Nuclear Magnetic Resonance Spectrum Inversion Based on the Residual Hybrid l1/l2 Norm ............................
.................................................................................... Y. Zou, R. Xie, M. Liu, J. Guo, and G. Jin 1194
Processing, Sensors and Systems for:
Radar Data
Building Layout Reconstruction in Concealed Human Target Sensing via UWB MIMO Through-Wall Imaging Radar
....................................................................... Y. Song, J. Hu, N. Chu, T. Jin, J. Zhang, and Z. Zhou 1199
(Contents Continued on Page 1138)
Específico
Construir de forma automática o índice
da revista IEEEGRSL;

Objetivos
PAPERS
Atmosphere
Oceans and Water
Radar Data
Específico
Construir de forma automática o índice da
revista IEEEGRSL;
Classificar os artigos de texto com base
no Título e no Resumo;

Objetivos
PAPERS
Atmosphere
Oceans and Water
Radar Data
Específico
revista IEEEGRSL;
Classificar os artigos de texto com base no
Título e no Resumo;
Observar o desempenho de diferentes
modelos;

Objetivos
PAPERS
Atmosphere
Oceans and Water
Radar Data
Específico
revista IEEEGRSL;
modelos;
Avaliar a capacidade dos algoritmos
para categorização;

Objetivos
PAPERS
Atmosphere
Oceans and Water
Radar Data
Específico
revista IEEEGRSL;
modelos;
Avaliar a capacidade dos algoritmos para
categorização;
Fazer uma comparativa do
desempenho das técnicas clássicas de
aprendizado de máquina e das técnicas
de aprendizagem profunda.

Objetivos
PAPERS
Atmosphere
Oceans and Water
Radar Data
Específico
revista IEEEGRSL;
modelos;
Avaliar a capacidade dos algoritmos para
categorização;
Fazer uma comparativa do desempenho das
técnicas clássicas de aprendizado de
máquina e das técnicas de aprendizagem
profunda.
Explorar os modelos de redes
profundas com o LIME/Tensorboard.

CRISP-DM
O modelo de referência desta pesquisa é o Cross Industry Standard Process
for Data Mining (CRISP-DM), utilizado na etapa de mineração de textos
(WIRTH; HIPP, 2000).
Figura 1: Diagrama CRISP-DM
Diferentes fases do processo:
1 Revisão bibliográfica, traçar
como alcançar os objetivos;

CRISP-DM
2 Entendimento, coleta,
explorar e verificar a
qualidade do seu dado;

CRISP-DM
2 Entendimento, coleta, explorar e
verificar a qualidade do seu
dado;
3 Pré-processamento dos
dados;

CRISP-DM
dado;
3 Pré-processamento dos dados;
4 Aplicação dos modelos de
Aprendizagem de Máquina;

CRISP-DM
dado;
5 Análise de resultados e
testes;

CRISP-DM
dado;
5 Análise de resultados e testes;
6 Validação.

Coleta dos Dados
Direta no site da revista.
Coletados dados dos anos de 2004 até Agost/2018 através de
downloads em formato BibTeX.
2830 artigos e 17 categorias.
Figura 2: Distribuição da Base de Dados por Categoria.

Ferramentas

Pré-processamento dos Dados
Os textos passaram por um Processamento de Linguagem Natural
(NLP - Natural Language Processing) usando os pacotes do
chamado tm e RTextTools.

TF-IDF
Para a classificação automática dos textos usando os métodos
clássicos foi aplicado o método de indexação TF-IDF (Term
Frequency–Inverse Document Frequency), a frequência do termo é
normalizado pela frequência inversa do documento, IDF.

TF-IDF por categoria para o Resumo
Synthetic.Aperture Radar Vegetation.and.Land.Surface
Optical.Data Radar.Data Radar.Systems SAR.Data Surface.and.Subsurface
Lidar.Data Lidar.Systems Microwave.Radiometry Miscellaneous.Applications Oceans.and.Water
atmosphere Cryosphere Hyperspectral.Da.a.Processing Image Processing.Analys.is.and Classification Image.Processing.and.Analysis
0.000 0.001 0.002 0.000
0.001
0.002
0.003
0.004
0.005
0.000
0.001
0.002
0.003
0.004 0.0000.0020.0040.0060.008 0.0000.0010.0020.0030.004 0.000 0.001 0.002 0.003 0.0000
0.0025
0.0050
0.0075
0.0100
0.000
0.005
0.010
0.015
0.020 0.000 0.005 0.010 0.015 0.0000.0010.0020.0030.004 0.000
0.002
0.004
0.006
0.008 0.000
0.002
0.004
0.006
0.008
0.000 0.002 0.004 0.0000.0050.0100.0150.020 0.000 0.005 0.010 0.00000.00050.00100.0015 0.000 0.001 0.002 0.003
ikonos
stopping
sa
dp
cvaps
ihs
lcc
endmember
unmixing
hyperspectral
winds
argo
salinity
coral
foam
bbp
ssb
ocean
sss
sst
resistivity
emissivity
basement
wl
smap
soil
moisture
gpr
tfpf
seismic
zy
quickbird
roi
road
multispectral
spm
panchromatic
mrf
ikonos
pan
aesd
campo
geopositioning
verde
nmr
lrpr
mmse
harmonic
drb
insect
mpcf
powerline
wishart
branching
cfbp
hrr
aperture
ngfs
sar
polsar
radar
superpixel
cr
fe
abundances
collaborative
endmember
hsis
endmembers
unmixing
hsi
hyperspectral
emissivity
lunar
atms
rfi
wm
tb
smos
moisture
microwave
ear
dpca
doppler
lfm
azimuth
chirp
waveform
rfi
wall
radar
mimo
bed
tsl
saf
glacier
onset
snowmelt
ice
glaciers
swe
snow
waveform
mabel
mobile
spin
photon
tls
laser
clouds
als
lidar
radar
iaa
dbs
evd
nlcsa
hrrp
slim
wemi
gpr
wsf
leaf
agb
soil
mes
biomass
moisture
lst
vwc
lai
rice
tropospheric
ionosphere
imerg
irregularities
tec
precipitation
rain
ionospheric
aerosol
aod
voxel
clouds
maize
hs
bayview
tvar
dsm
nonground
fpar
lidar
lbsp
mewma
obfs
pcm
vaihingen
xgboost
cd
segmentation
cnns
stereo
uas
fcn
ntl
proposals
osm
multilook
isar
deformation
polarimetric
interferometric
azimuth
aperture
sar
polsar
radar
TF−IDF
Words

Qual é o assunto deste artigo?
1204 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 15, NO. 8, AUGUST 2018
Millimeter-Wave Ultrahigh Resolution SAR Image
Classification Based on a New Feature Set
Wenjin Wu , Xinwu Li, Member, IEEE, Huadong Guo, Member, IEEE, and Lei Liang
Abstract—Aiming at the problems and prospects in millimeter-
wave ultrahigh resolution synthetic aperture radar applications,
we have developed a method with a new feature set for sophisti-
cated classification of large images. It includes innovative parame-
ters derived from different kinds of spectral and characteristic
signatures, such as the correlation signature, radial spectrum,
and angular spectrum. These features can mine repetitive infor-
mation from the fragmented patterns and enhance the texture
description in different aspects. In the experiment, the proposed
feature set achieves 89% overall accuracy which is 25% higher
compared with the gray-level co-occurrence matrix feature set.
The four new features contribute to over 50% of the accuracy
improvement with a significant increase of the accuracy for
vehicles and show a fair performance for all the categories.
Index Terms—Feature extraction, image classification, ultra-
high resolution (UHR) synthetic aperture radar (SAR).
I. INTRODUCTION
AIRBORNE ultrahigh resolution (UHR) synthetic aperture
radar (SAR) systems with decimeter and centimeter reso-
lutions are developing rapidly, among which the millimeter-
wave (MMW) ones become increasingly popular because of
the small antenna size. However, due to the special wave-
length, images of this kind neither have the same speckle
noise behavior and statistics as regular SAR data nor have
smooth features like optical data. The patterns are highly
fine-grained, and the textures are extremely fragmented. For
example, in Fig. 1, we can clearly see the patterns on a green
belt, whereas enormous variations of the scattering intensities,
structures, and outlines for a single ground object can be
visualized. Because of the high resolution, the backscattering
information can also be meaningless when viewing from
a small window, especially for large man-made structures.
Fig. 1(b) presents a part of a tall building, and we can
hardly recognize it in this image patch. These unique traits
make MMW UHR SAR images very difficult to process, and
applicable methods are rarely addressed in the literature.
Texture analysis is usually effective for regular SAR
images. Various feature extraction methods have been applied,
among which the ones based on gray-level co-occurrence
Manuscript received January 9, 2018; revised March 22, 2018; accepted
April 24, 2018. Date of publication May 16, 2018; date of current version
July 26, 2018. This work was supported in part by the Young Scientists
Fund of the National Natural Science Foundation of China under Grant
41601361 and in part by the Director Program through RADI, CAS, under
Grant Y6SJ1700CX. (Corresponding author: Xinwu Li.)
W. Wu, X. Li, and H. Guo are with the Key Laboratory of Digital Earth
Sciences, Institute of Remote Sensing and Digital Earth, Chinese Academy of
Sciences, Beijing 100094, China (e-mail: wuwj@radi.ac.cn; lixw@radi.ac.cn).
L. Liang is with the Institute of Geographic Sciences and Natural Resources
Research, Chinese Academy of Sciences, Beijing 100101, China.
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LGRS.2018.2830794
Fig. 1. Patches of MMW UHR SAR images. (a) Part of green belt.
(b) Sidewall of a tall building.
matrix (GLCM) are commonly used [1]–[6]. The GLCM is
defined as “a 2-D histogram of gray levels for a pair of pixels
separated by a fixed spatial relationship” [1]. Descriptors,
including the energy, entropy, contrast, variance, homogeneity,
and correlation coefficients, are usually adopted. Modified
versions of the GLCM, such as multiscale GLCM [7-8] and
object-based GLCM [9], have also been proposed. However,
when it comes to the UHR SAR, classical scattering models,
such as the product model and speckle model, are violated.
The statistical distribution becomes sharp-peaked and heavy-
tailed. The textures are also different, not to mention the fine-
grained patterns in MMW images. This deactivates the existing
processing methods which probably include the GLCM.
To improve the effectiveness of pattern description in high-
resolution SAR images, Popescu et al. [10], [11] proposed a
feature set that derives indicators from the frequency spectrum
to classify a TerraSAR-X patch data set and achieves supe-
rior results. However, because UHR SAR images are highly
heterogeneous, obtaining classification results for a continuous
large image can be more difficult than just dealing with
carefully selected patches, since we need to handle regions
with nontypical, mixed, or unexpected categories. The clear
and diverse patterns of vegetation in MMW UHR SAR images
make the task even more challenging. New features that can
describe patterns from additional aspects are highly required.
Patch analysis is a very important branch in ecological studies.
In 2014, a framework used to detect ecological transitions by
analyzing distribution modes of different vegetation categories
is presented [12]. The framework extracts feature from various
signatures; each reveals a special aspect of the ecosystem.
This paper enlightens us to develop features from more
spectral and characteristic signatures [13] and then form a
more comprehensive feature set to support MMW UHR SAR
image classification. The feature set and extraction method are
1545-598X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Qual é o assunto deste artigo?
1204 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 15, NO. 8, AUGUST 2018
Millimeter-Wave Ultrahigh Resolution SAR Image
Classification Based on a New Feature Set
Wenjin Wu , Xinwu Li, Member, IEEE, Huadong Guo, Member, IEEE, and Lei Liang
Abstract—Aiming at the problems and prospects in millimeter-
wave ultrahigh resolution synthetic aperture radar applications,
we have developed a method with a new feature set for sophisti-
cated classification of large images. It includes innovative parame-
ters derived from different kinds of spectral and characteristic
signatures, such as the correlation signature, radial spectrum,
and angular spectrum. These features can mine repetitive infor-
mation from the fragmented patterns and enhance the texture
description in different aspects. In the experiment, the proposed
feature set achieves 89% overall accuracy which is 25% higher
compared with the gray-level co-occurrence matrix feature set.
The four new features contribute to over 50% of the accuracy
improvement with a significant increase of the accuracy for
vehicles and show a fair performance for all the categories.
Index Terms—Feature extraction, image classification, ultra-
high resolution (UHR) synthetic aperture radar (SAR).
I. INTRODUCTION
AIRBORNE ultrahigh resolution (UHR) synthetic aperture
radar (SAR) systems with decimeter and centimeter reso-
lutions are developing rapidly, among which the millimeter-
wave (MMW) ones become increasingly popular because of
the small antenna size. However, due to the special wave-
length, images of this kind neither have the same speckle
noise behavior and statistics as regular SAR data nor have
smooth features like optical data. The patterns are highly
fine-grained, and the textures are extremely fragmented. For
example, in Fig. 1, we can clearly see the patterns on a green
belt, whereas enormous variations of the scattering intensities,
structures, and outlines for a single ground object can be
visualized. Because of the high resolution, the backscattering
information can also be meaningless when viewing from
a small window, especially for large man-made structures.
Fig. 1(b) presents a part of a tall building, and we can
hardly recognize it in this image patch. These unique traits
make MMW UHR SAR images very difficult to process, and
applicable methods are rarely addressed in the literature.
Texture analysis is usually effective for regular SAR
images. Various feature extraction methods have been applied,
among which the ones based on gray-level co-occurrence
Manuscript received January 9, 2018; revised March 22, 2018; accepted
April 24, 2018. Date of publication May 16, 2018; date of current version
July 26, 2018. This work was supported in part by the Young Scientists
Fund of the National Natural Science Foundation of China under Grant
41601361 and in part by the Director Program through RADI, CAS, under
Grant Y6SJ1700CX. (Corresponding author: Xinwu Li.)
W. Wu, X. Li, and H. Guo are with the Key Laboratory of Digital Earth
Sciences, Institute of Remote Sensing and Digital Earth, Chinese Academy of
Sciences, Beijing 100094, China (e-mail: wuwj@radi.ac.cn; lixw@radi.ac.cn).
L. Liang is with the Institute of Geographic Sciences and Natural Resources
Research, Chinese Academy of Sciences, Beijing 100101, China.
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LGRS.2018.2830794
Fig. 1. Patches of MMW UHR SAR images. (a) Part of green belt.
(b) Sidewall of a tall building.
matrix (GLCM) are commonly used [1]–[6]. The GLCM is
defined as “a 2-D histogram of gray levels for a pair of pixels
separated by a fixed spatial relationship” [1]. Descriptors,
including the energy, entropy, contrast, variance, homogeneity,
and correlation coefficients, are usually adopted. Modified
versions of the GLCM, such as multiscale GLCM [7-8] and
object-based GLCM [9], have also been proposed. However,
when it comes to the UHR SAR, classical scattering models,
such as the product model and speckle model, are violated.
The statistical distribution becomes sharp-peaked and heavy-
tailed. The textures are also different, not to mention the fine-
grained patterns in MMW images. This deactivates the existing
processing methods which probably include the GLCM.
To improve the effectiveness of pattern description in high-
resolution SAR images, Popescu et al. [10], [11] proposed a
feature set that derives indicators from the frequency spectrum
to classify a TerraSAR-X patch data set and achieves supe-
rior results. However, because UHR SAR images are highly
heterogeneous, obtaining classification results for a continuous
large image can be more difficult than just dealing with
carefully selected patches, since we need to handle regions
with nontypical, mixed, or unexpected categories. The clear
and diverse patterns of vegetation in MMW UHR SAR images
make the task even more challenging. New features that can
describe patterns from additional aspects are highly required.
Patch analysis is a very important branch in ecological studies.
In 2014, a framework used to detect ecological transitions by
analyzing distribution modes of different vegetation categories
is presented [12]. The framework extracts feature from various
signatures; each reveals a special aspect of the ecosystem.
This paper enlightens us to develop features from more
spectral and characteristic signatures [13] and then form a
more comprehensive feature set to support MMW UHR SAR
image classification. The feature set and extraction method are
1545-598X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Categorias
1- Atmosphere;
2- Cryosphere;
3- Hyperspectral Data Processing;
4- Image Processing and Analysis;
5- Image Processing, Analysis and
Classification;
6- Lidar Data;
7- Lidar Systems;
8- Microwave Radiometry;
9- Miscellaneous Applications;
10- Oceans and Water;
11- Optical Data;
12- Radar Data;
13- Radar Systems;
14- SAR Data;
15- Surface and Subsurface Properties;
16- Synthetic Aperture Radar;
17- Vegetation and Land Surface.

Algoritmos de Classificação de Texto
Alguns dos mais populares algoritmos de aprendizado de máquina para
criar modelos de classificação de texto incluem a família de algoritmos:
Máxima Entropia "Maximum Entropy Modeling - MaxEnt"(JURKA,
2012);

2012);
Máquina de Vetores de Suporte "Support Vector Machine -
SVM"(DIMITRIADOU et al., 2008);

2012);
Agregação por Bootstrap "Bootstrap Aggregating -
Bagging"(PETERS; HOTHORN; LAUSEN, 2002);

2012);
Boosting (TUSZYNSKI, 2012);

2012);
Redes Neurais da NNET (VENABLES; RIPLEY, 2002);

2012);
Floresta Aleatória "Random Forest - RF"(LIAW; WIENER et al.,
2002);

2012);
2002);
Análise Discriminante Linear Escalada "SLDA"(PETERS; HOTHORN;
LAUSEN, 2002);

2012);
2002);
Análise Discriminante Linear Escalada "SLDA"(PETERS; HOTHORN;
LAUSEN, 2002);
Árvore de Decisão "Decision Trees - TREE"(VENABLES; RIPLEY,
2002);
Naïve Bayes - NB (BAYES, 1763).

Estratégias de validação de algoritmos supervisionados
Em nossos experimentos, usamos o método de validação cruzada
que é uma técnica na qual visa entender como o modelo generaliza.
Figura 5: K-fold Cross Validation.

Métricas de Avaliação

Métricas de Avaliação
No campo da recuperação de informações, precisão é a fração de
documentos recuperados que são relevantes para a consulta e o recall é a
fração dos documentos relevantes que são recuperados com sucesso.
precision =
|{Documentos relevantes} ∩ {Documentos recuperados}|
|{Documentos recuperados}|
(1)
recall =
|{Documentos relevantes} ∩ {Documentos recuperados}|
|{Documentos relevantes}|
(2)
F1 = 2 ·
precision · recall
precision + recall
(3)

Modelo Deep Learning
Modelo usando Redes Profundas:
Recurrent Neural Networks - RNN (Long Short-Term
Memory - LSTM) - Reconhecimento de fala, Tradução
de linguagem, Previsões de estoque, Reconhecimento de
imagem para descrever o conteúdo em imagens;

Memory - LSTM) - Reconhecimento de fala, Tradução de
linguagem, Previsões de estoque, Reconhecimento de imagem
para descrever o conteúdo em imagens;
CNN - Convolutional Neural Networks (ConvNets)
(ZHANG; ZHAO; LECUN, 2015);

Memory - LSTM) - Reconhecimento de fala, Tradução de
linguagem, Previsões de estoque, Reconhecimento de imagem
para descrever o conteúdo em imagens;
CNN - Convolutional Neural Networks (ConvNets) (ZHANG;
ZHAO; LECUN, 2015);
Word Embeddings (Word2Vec/GloVe) (MIKOLOV et
al., 2013).

Deep Learning Neural Network
Figura 6: Arquitetura de uma simples Rede Neural/ Rede Neural de
Aprendizagem Profunda.

O que são o Word Embeddings?
A incorporação de palavras (Word2Vec/GloVe) é um método usado
para mapear palavras de um vocabulário para vetores densos de
números reais, em que palavras semanticamente semelhantes são
mapeadas para pontos próximos (PENNINGTON; SOCHER;
MANNING, 2014).
Figura 7: Saída de um modelo de incorporação de palavras. Fonte:
(MIKOLOV et al., 2013)

Recurrent Neural Networks (RNN)
Definição
RNNs são redes neurais que são boas em modelar dados de
sequência para previsões, mas sofrem de memória de curto
prazo.
O problema de memória de curto prazo para as RNN’s não
significa ignorá-las completamente basta usar as versões mais
evoluídas, como LSTM’s ou GRU’s.

Long Short-Term Memory (LSTM’s)
Figura 8: Arquitetura do Modelo Long Short-Term Memory (LSTM’s)
aplicado a modelagem de linguagem para de classificação de texto.

Entendendo Redes Neurais Convolucionais para PNL
(ConvNets/CNN)
Definição
A CNN é uma classe de redes neurais artificiais profundas e
avançadas (onde as conexões entre os nós não formam um
ciclo) e usam uma variação de percepções multicamadas
projetadas para requerer um pré-processamento mínimo.
As CNNs são basicamente apenas várias camadas de
convoluções com funções de ativação não lineares, como ReLU
ou tanh, aplicadas aos resultados.

Redes Neurais Convolucionais para PNL (ConvNets/CNN)
Figura 10: Ilustração de uma arquitetura da rede neural (CNN) para a
classificação da sentença. Fonte: (ZHANG; WALLACE, 2015)

Desempenho dos Modelos
Algoritmo PRECISION RECALL F-SCORE
SVM 0.90 0.92 0.91
SLDA 0.89 0.87 0.87
LOGITBOOST 0.98 0.95 0.96
BAGGING 0.96 0.96 0.96
FORESTS 0.93 0.88 0.89
TREE 0.80 0.83 0.81
NNETWORK 0.16 0.18 0.15
MAXENTROPY 0.97 0.91 0.93
Naïve Bayes 0.52
Tabela 1: Desempenho dos Modelos, Precisão, Recall, F-scores para variável
Título.

Desempenho dos Modelos
Algoritmo PRECISION RECALL F-SCORE
SVM 0.55 0.54 0.53
LOGITBOOST 0.91 0.86 0.87
FORESTS 0.76 0.63 0.64
TREE 0.67 0.64 0.64
MAXENTROPY 0.66 0.58 0.59
Naïve Bayes 0.50
Tabela 2: Desempenho dos Modelos, Precisão, Recall, F-scores para variável Abstract.

Distribuição das Probabilidades
Distribuição das Probabilidades de Boosting contra se eles estavam
corretos versus incorretos.
Figura 11: Variável Título. Figura 12: Variável Resumo.

Figura 13: Distribuição das Probabilidades de Boosting corretos versus
incorretos por Classe (Variável Título).

Figura 14: Distribuição das Probabilidades de Boosting corretos versus
incorretos por Classe (Variável Resumo).

LIME
Método Boosting com as duas variáveis (Título e Resumo)
Accuracy 98% para a classe Cryosphere.
Figura 15: Distribuição das Probabilidades de Boosting

LIME
Vamos ver as explicações:
Figura 16: Distribuição das Probabilidades de Boosting

LIME/Shiny
Figura 18: Distribuição das Probabilidades de Boosting para as palavras
mais relevantes da categoria Cryosphere.

Ensemble Agreement Coverage e Recall
Ensemble é o processo de combinar diversos classificadores
para gerar um método que usa as qualidades individuais de
cada classificador.
O uso de vários classificadores é uma estratégia bastante
utilizada para aumentar o desempenho de sistemas de
reconhecimento de padrões.

Ensemble Agreement Coverage e Recall
COVERAGE RECALL
n >= 1 1.00 0.98
n >= 2 1.00 0.98
n >= 3 1.00 0.98
n >= 4 1.00 0.98
n >= 5 0.98 0.99
n >= 6 0.96 1.00
Tabela 3: Ensemble Agreement
Coverage e Recall para variável
Título.
COVERAGE RECALL
n >= 1 1.00 0.87
n >= 2 1.00 0.88
n >= 3 0.95 0.90
n >= 4 0.79 0.96
n >= 5 0.60 0.97
Tabela 4: Ensemble Agreement
Coverage e Recall para variável
Abstract.

Arquitetura do Modelo Recurrent Neural Networks
Figura 19: Arquitetura do Modelo RNN.

Fluxo de dados do Modelo Recurrent Neural Networks -
RNN

Métricas de Avaliação RNN

Arquitetura do Modelo Recurrent Convolutional Networks

Fluxo de dados do Modelo Recurrent Convolutional
Networks - RCN

Métricas de Avaliação RCN

Conclusão
A prova de conceito aqui efetuada demonstrou a viabilidade de
algumas aplicações desta solução, evidenciando que o índice pode
ser construído de forma semi-automatizada.
Limitação: Os Modelos de Deep Learning (RNN e RCN) precisam
ser avaliados por outras métricas de desempenho.

Trabalhos Futuros
Como trabalhos futuros pretende-se:
Aumentar o conjunto de dados e avaliar o desempenho da
RCN frente aos métodos clássicos.
Avaliar a RNN e RCN por outras métricas de avaliação, usar o
LIME para explicar o modelo.
Avaliar o desempenho dos Modelos.
CapsNet ou Capsules Net (SABOUR; FROSST; HINTON,
2017; SUTSKEVER; MARTENS; HINTON, 2011).
Convolucionais - CNN: LeNet, AlexNet, ZFNet, GoogleNet,
VGGNet, ResNet.

Referências I
BARROS, P. et al. Identifying communities in social media
with deep learning. In: SPRINGER. International Conference on
Social Computing and Social Media. [S.l.], 2018. p. 171–182.
BAYES, T. Lii. an essay towards solving a problem in the
doctrine of chances. by the late rev. mr. bayes, frs communicated
by mr. price, in a letter to john canton, amfr s. Philosophical
transactions of the Royal Society of London, The Royal Society
London, n. 53, p. 370–418, 1763.
DIMITRIADOU, E. et al. Misc functions of the department
of statistics (e1071), tu wien. R package, v. 1, p. 5–24, 2008.
ELBERRICHI, Z.; RAHMOUN, A.; BENTAALAH, M. A.
Using wordnet for text categorization. International Arab Journal
of Information Technology (IAJIT), v. 5, n. 1, 2008.

Referências II
GEUTNER, P.; BODENHAUSEN, U.; WAIBEL, A. Flexibility
through incremental learning: Neural networks for text
categorization. In: Proceedings of WCNN-93, World Congress
on Neural Networks. [S.l.: s.n.], 1993. p. 24–27.
GO, A.; BHAYANI, R.; HUANG, L. Twitter sentiment
classification using distant supervision. CS224N Project Report,
Stanford, v. 1, n. 12, p. 2009, 2009.
HAYES, P. J. et al. Tcs: a shell for content-based text
categorization. In: IEEE. Artificial Intelligence Applications,
1990., Sixth Conference on. [S.l.], 1990. p. 320–326.
HAYES, P. J.; WEINSTEIN, S. P. Adding value to financial
news by computer. In: IEEE. Proceedings First International
Conference on Artificial Intelligence Applications on Wall Street.
[S.l.], 1991. p. 2–8.

Referências III
JURKA, T. P. Maxent: an r package for low-memory
multinomial logistic regression with support for semi-automated
text classification. The R Journal, v. 4, n. 1, p. 56–59, 2012.
LEWIS, D. D. Representation and learning in information
retrieval. Tese (Doutorado) — University of Massachusetts at
Amherst, 1992.
LIAW, A.; WIENER, M. et al. Classification and regression by
randomforest. R news, v. 2, n. 3, p. 18–22, 2002.
MALMASI, S.; DRAS, M. Language identification using
classifier ensembles. In: Proceedings of the Joint Workshop on
Language Technology for Closely Related Languages, Varieties
and Dialects. [S.l.: s.n.], 2015. p. 35–43.

Referências IV
MASAND, B.; LINOFF, G.; WALTZ, D. Classifying news
stories using memory based reasoning. In: ACM. Proceedings
of the 15th annual international ACM SIGIR conference on
Research and development in information retrieval. [S.l.], 1992.
p. 59–65.
MIKOLOV, T. et al. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781,
2013.
PANG, B.; LEE, L.; VAITHYANATHAN, S. Thumbs up?:
sentiment classification using machine learning techniques.
In: ASSOCIATION FOR COMPUTATIONAL LINGUISTICS.
Proceedings of the ACL-02 conference on Empirical methods in
natural language processing-Volume 10. [S.l.], 2002. p. 79–86.

Referências V
PENNINGTON, J.; SOCHER, R.; MANNING, C. Glove:
Global vectors for word representation. In: Proceedings of the
2014 conference on empirical methods in natural language
processing (EMNLP). [S.l.: s.n.], 2014. p. 1532–1543.
PETERS, A.; HOTHORN, T.; LAUSEN, B. ipred:
Improved predictors. R News, v. 2, n. 2, p. 33–36,
June 2002. ISSN 1609–3631. Disponível em: <http:
//CRAN.R-project.org/doc/Rnews/>.
RANGEL, F. et al. Overview of the 5th author profiling task
at pan 2017: Gender and language variety identification in
twitter. Working Notes Papers of the CLEF, 2017.
SABOUR, S.; FROSST, N.; HINTON, G. E. Dynamic routing
between capsules. In: Advances in neural information processing
systems. [S.l.: s.n.], 2017. p. 3856–3866.

Referências VI
SILVA, R. M.; YAMAKAMI, A.; ALMEIDA, T. A. An analysis
of machine learning methods for spam host detection. In: IEEE.
2012 11th International Conference on Machine Learning and
Applications. [S.l.], 2012. v. 2, p. 227–232.
SUTSKEVER, I.; MARTENS, J.; HINTON, G. E. Generating
text with recurrent neural networks. In: Proceedings of the 28th
International Conference on Machine Learning (ICML-11). [S.l.:
s.n.], 2011. p. 1017–1024.
TUSZYNSKI, J. catools: Tools: moving window statistics,
gif, base64, roc auc, etc., r package version 1.17. 1. URL
http://CRAN. R-project. org/package= caTools.[accessed 01
April 2014], 2012.
VENABLES, W.; RIPLEY, B. Modern applied statistics with
s springer-verlag. New York, 2002.

Referências VII
WIRTH, R.; HIPP, J. Crisp-dm: Towards a standard process
model for data mining. In: CITESEER. Proceedings of the
4th international conference on the practical applications of
knowledge discovery and data mining. [S.l.], 2000. p. 29–39.
WU, T. et al. Twitter spam detection based on deep learning.
In: ACM. Proceedings of the Australasian Computer Science
Week Multiconference. [S.l.], 2017. p. 3.
ZHANG, X.; ZHAO, J.; LECUN, Y. Character-level
convolutional networks for text classification. In: Advances in
neural information processing systems. [S.l.: s.n.], 2015. p.
649–657.
ZHANG, Y.; WALLACE, B. A sensitivity analysis of (and
practitioners’ guide to) convolutional neural networks for
sentence classification. arXiv preprint arXiv:1510.03820, 2015.

"Esse seu trabalho parece que é eterno, ele não tem fim?"
(Berenice)

Categorização de Textos por Aprendizagem de Máquina.pdf

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Categorização de Textos por Aprendizagem de Máquina.pdf

Semelhante a Categorização de Textos por Aprendizagem de Máquina.pdf (8)

Categorização de Textos por Aprendizagem de Máquina.pdf