SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 388
DIGEST PODCAST
Mitran R1, Kamleshwar T2, Vignesh N3, Sri Harsha Arigapudi4, Nitin S5
1,2,3,4,5UG Student, Department of Computer Science and Engineering, PSG College of Technology, Coimbatore.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - This is the age of instant gratification. Browsing
the entire Publication is dawdle , hence we have proposed an
application that summarizes all your reading’s in a snap of
time using AI technologies. The system is composed of Optical
Character Recognition(OCR) engine to convert the image to
text and transformer to summarise the text, dispensing
recurrent networks followed by feature prediction network
that maps character embeddings to mel-spectrogram and a
GAN based vocoder to convert spectrogram to time based
waveforms. Through extensive experiments we demonstrate
digest podcast ability to recognize, summarize, speech
synthesis for summarized audio generation. Our code can be
found at https://github.com/mitran27/Digest_Cassette.
Index terms: Optical Character Recognition, segmentation,
summarisation, mel-spectrogram, GAN, speech synthesis.
1.INTRODUCTION
Reading and interpreting a protracted segment of
text is a challenging task. The above process in natural scene
text is generally divided into three major tasks like Optical
Character Recognition, Summarization(interpretation) and
synthesizing the audio.
Fig-1: block diagram of digest podcast system architecture
The promise of deep learning architectures to
produce probability distributions for applications such as
natural images, audio containing speech, natural language
corpora. There are very few applications over the decade to
combine different techniques to dominate the field. Due to
the wide range of vocabulary in the language and speech
generated by the model was not promising. We propose a
new podcast application which overcome the above
drawbacks.
In the proposed application, thearchitectureisbuilt
with the state of the art networks after a bunch of research
studies to find better models with their hyperparameters.
The OCR model is built using the scene text
detection and text recognition using the computer vision
architectures which are proved to give promising results.
The summarization model is built using the transformer
eschewing recurrent networks and relying entirely on self-
attention mechanism, which allows more parallelization to
reach the new state of the art model. The choice of
abstractive or extractive summarization is given to the user.
The podcast model which generates the natural speechfrom
the summarized text is built using two independent
networks for feature prediction and speech synthesis. The
layers in the model are chosen carefully to avoid artifacts in
the sound. Hyperparameters are chosen properly to
generate synthetic speech more natural comparedtohuman
speech.
2. RELATED WORK
Recognizing tasks/problems have been efficiently
addressed by Hidden Markov Model, Recurrent Neural
Networks,LongShort-Term Memory(LSTM)network.These
methods use artificial neural networks or stochastic models
to find the probability distribution of the output. Although
these methods are efficient, they have a drawback of limited
variation of inputs. So, we use Transformers[1], which uses
attention mechanisms to find the output. Transformersisan
architecture where each word in the text will paya weighted
attention to the other words in the text, which makes the
model work with large sentences. Transformer model
exhibits a high generalization capacity oftextwhichmakesit
state of the art in the NLP domain.
Text Rank algorithm[2] which was inspired by the
PageRank algorithm created by Larry Page. In place of web
pages, we use sentences. Similarity between any two
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 389
sentences is used as an equivalent to the webpagetransition
probability. The similarity scores are stored in a square
matrix, similar to the matrix M usedfor PageRank.TextRank
is an extractive and unsupervised text summarization
technique.
Segmentation results can more accurately describe
scene text of various shapes such as curve text.However,the
post-processing of binarizationisessential forsegmentation.
DBnet[3] suggests trainable dynamic binarization had a
greater impact on predicting the coordinates of the text.
Recognition involves extracting the text from the
input image using the boundingboxesobtainedfromthetext
detection model and the text is predicted using a model
containing a stack of VGG layers and the output is aligned
and translated using CTC[5] loss function. Baek suggested
that using real scene text datasets[4] had significant impact
than synthetically generated text dataset.
Connectionist Temporal Classification[5] suggests
that training a sequence-to-sequence language model
requires computation of loss to fine tune the model, vanilla
models utilize cross entropy forfindinglossderivatives.Text
recognition is sequence to sequence modelsbutsuffersfrom
a drawback that they do not have fixed output length and
they are not aligned with the input. So, in order to Label
Unsegmented Sequence Data, we use CTC which computes
loss by finding all the best possible path of the target using
conditional probability from the probability distribution of
classes available from all time steps.
A sequence to sequence model with attention and
location awareness[6] is used to convert the summarized
text to spectrograms, which is an intermediate feature for
audio using an encoder-decoder GRU[9].
Generative Adversarial Network[7] is an
unsupervised learningtask inmachinelearningthatinvolves
automatically discovering and learning the regularities or
patterns in input data in such a way that the model can be
used to generate or output new examples that plausibly
could have been drawn from the original dataset.
High-Fidelity Audio Generation andRepresentation
Learning With Guided Adversarial Autoencoder[8]thatHiFi-
GAN consists of one generator and two discriminators:
multi-scale and multi-period discriminators. The generator
and discriminators are trained adversarial, along with two
additional losses for improving training stability and model
performance.
MelGAN[10]hasgeneratorwhichconsistsofstack of
up-sampling layers with dilated convolutionstoimprove the
receptive field and a multiscale discriminator which
examines audio at different scales since audio has structure
at different levels.
3. MODEL ARCHITECTURE
Our proposed system consists of six components,
shown in Figure 1
(1) Semantic segmentation model with differentiable
binarization
(2) Temporal recognition network with CTC
(3) Abstractive summarization model using attention
mechanism
(4) Extractive summarization algorithm using
PageRank algorithm with word2vec embeddings
(5) Feature prediction network with attention and
locationawareness whichpredictsmel-spectrogram
(6) Generative model generates time domain
waveforms conditioned on mel-spectrogram
3.1. DIFFERENTIABLE BINARIZATION
SEGMENTATION NETWORK
DBnet uses an differentiable binarization map to
adaptively set threshold for segmentationmaptopredictthe
result. Spatial dynamic thresholdsfoundtobemoreeffective
than static threshold.
Static binarization
if(Pi,j>threshold)
Bi,j=1
else
Bi,j=0
Dynamic binarization
Bi,j = div(1,1+exp(-K*(Pi,j-Ti,j)))
where B is the final binary map,
P is the prediction map,
T is the threshold map,
K is a random value empirically set to 50.
The static binarization is not differentiable. Thus it
cannot be trained along with the segmentation network.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 390
Fig-2: Feature Pyramid Network(FPN)
The segmentation architecture of our method is
shown in Figure 2. Initially the image is passed through
feature pyramid network and the output features are
concatenated and the final prediction is passed through
prediction and threshold networks which consists of couple
of transposed convolution layers to up sample to its original
size.
Our model works well with light weight backbone,
with the backbone of RESNET 18. Features were taken at
three stages at the scale of (8x,16x,32x) of the original size.
The outputs of the FPN produced features of dimension
dmodel = 256.
The loss is computed using DICE loss for prediction
map and binary map and L1 loss for threshold map. Losses
are compared to the ground truth map of the original image.
3.2. RECOGNITION
Temporal image recognition is performed in word
images to recognize the words in the text format. This is
achieved by constructing a model which consists of VGG
network for feature prediction, GRCNN can be used in cases
where high accuracy is needed neglecting the speed of the
model. The features are transposed to temporal dimension
and passed to two layered bi-directional LSTM model with
1024 units to predict the output temporally(Figure 3). The
output from the model is not aligned with the input image.
So ConnectionistTemporal Classification(CTC)model isused
because CTC algorithm is alignment free, CTC works by
summing the probability of all possible alignments.
Fig-3: Prediction from the LSTM model
CTC is a neural network which uses associated
scoring functions to train recurrent networks such as LSTM,
GRU to tackle sequence applications where time is variable.
CTC is differentiable with output probabilities from the
recurrent network, since it sums the score of each
alignment(Figure 4), gradient is computed for the loss
function with respect to the output and back propagation is
done to train the model.
Fig-4: CTC of the output probabilities
3.3. ABSTRACTIVE SUMMARIZATION
The abstractive summarization isachieved byusing
sequence to sequence models recurrent networks, LSTM,
GRU are firmly established in text summarization. Theusage
of these models restricts their prediction in short range. In
contrast, transformer networks allows to capture longer
range. We used self-attention network asa buildingblock for
transformers with embedding layer for word embeddings
and positional encoding for positional embeddings since
positional embeddings works for variable sentences.
Fig-5: Scaled Dot-Product Attention
The word embedding along with its position are
passed to the encoder where attention mechanism is used
(Figure 5) to learn by mapping a query and a setofkey-value
pairs to an output vector. The basic idea is that, if the model
is aware of the abstract, then it provides high level
interpretation vectors.
A decoder with similar architecture mentioned in
encoder is used to obtain the output prediction using the
interpreted vectors.
3.4. EXTRACTIVE SUMMARIZATION
The extractive approach involves picking up the
significant phrases from the given document, which is
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 391
achieved by creating a sentence graph and measuring the
importance of each sentence within the graph, based on
incoming relationships. The value of the incoming
relationships are measured using cosine distance between
sentence embeddings. The created graph passed into the
Page Rank algorithm to rank the sentences in the document.
The dimensions for the word embeddings are
empirically set to 100.
Fig-6: Sentence graph
3.5. ALIGNER – INTERMEDIATE FEATURE
REPRESENTATION
Converting text(200 tokens) to audio
waveform(200K tokens) is not possible. So we chose low
level intermediaterepresentation:mel-spectrogramtoactas
a bridge between the aligner and the vocoder. Mel-
spectrograms scale down the audio to 256x times. Mel-
spectrograms are created by applying short-time Fourier
transform (STFT) with a frame size 50ms and hop size
12.5ms and STFT is transformed to mel-scale using 80
channel mel-filterbank followed by log dynamic range
compression, because humans respond to signals
logarithmically.
Since the dimension of spectrograms are morethan
the dimension of text, sequence to sequence model is used.
The model consists of encoder and decoder with attention
and locational awareness. Attention mechanism is used to
handle long range sequences. Experiments without
locational awareness produce madethedecodertorepeat or
ignore some sequences(Figure 7).
Fig-7: Output without locational awareness
The encoder converts the character sequence to
character embedding with dimension 512 which are passed
through bi-directional GRUlayercontaining512units(256in
each direction) to generate the encoded features. The
encoded features are consumed bythedecodermodel which
contains two layer of auto regressive GRU of 1024 units and
an attention network with dimension 128. The attention
network is represented by the below equation.
a(st-1, hi) = vT tanh(W1 hi + W2 St-1 + W3(conv( a(st-2, h)))
where a represents the attention scores,
s represents the decoder hidden states,
h represents the encoder hidden states,
W1,W2,W3 represents the linear network,
conv represents the convolution network to extract
features from the attention scores.
The convolution network for locational awarenessis
built with kernel size 31 and dimension 32.
A pre-net is built using feedforward network to
teacher-force the original previous timestamp to the
decoder. Experimentswithdimension128and256madethe
decoder alter the input rather than attending the encoded
features. Dimension 64 with high dropouts(0.5) forced the
decoder to attend the encoded features because there is no
full access to the teacher-forced inputs. This makes the
decoder predict correctly even if there is a slip inthecurrent
time stamp input(previous time stamp output).
The output from the decoder is passed through
feature enhancer network to predict a residual to enhance
the overall construction of spectrogram which consists of
four layers of residual networks of dimension 256 with tanh
activation and dropouts are added at each layer to avoid
overfitting. We minimize the mean squared error (MSE)
from before and after the enhancer to aid convergence.
3.6. VOCODER – GANS
Converting spectrograms to time domain
waveforms auto-regressively is a tedious and time
consuming process.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 392
3.6.1. GENERATOR
A spectrogram is approximately 256x smaller than
the audio. A Generative model is built using (8x,8x,2x,2x)
up-sampling. Experiments were carried to find the best
scaling factors to attain 256x. Firstly, a feature extractor
network is used to extract features with a kernel size 7 and
dimension 512 to feed the up-sampling layers. Up-sampling
layer network consists of transposed convolution with its
correspondingscalefactorandsomeconvolutional networks
to increase the receptive field. Experiments are made for
increasing the receptive field, after transposed convolution
the receptive field would be low. Receptive field of a stack of
dilated convolution increases exponentially with number of
layers, while vanilla convolutions increases linearly.
The stack dilation convolution consists of three
dilated RESNETs with kernel size 5 and dilations 1,5,25. The
receptive field of the network should look like a fully
balanced symmetric tree with kernel size as the branch
factor. Finally, convolutional layer is used to predictthefinal
audio(dimension of size 1) and an tanh activation function.
3.6.2. DISCRIMINATOR
Discriminators are used to classify whether the
given sample is real or fake. We adopt a multiscale
discriminator to operate on differentscalesofaudio.Feature
Pyramid Networks(Figure 2) are used to extract features at
different stages on scale 2x,4x,8x with kernel size 15 and
dimension 64. The features fromeachstagearepassedto the
discriminator block.
Each discriminator block learnsfeaturesofdifferent
frequency range of the audio by stack ofconvolutional layers
with kernel size 41 and stride 4 and a final layer to predict
the binary class(real/fake).
4. EXPERIMENTS AND RESULTS
4.1. TRAINING SETUP
I. Segmentationmodel wastrainedonRobustReading
Competition challenges datasetslikeDocVQA2020-
21, SROIE 2019, COCO-Text 2017 which contained
document image dataset along with their
coordinates of the words.
II. Recognition network was trained using two major
synthetic datasets. MJSynth (MJ) whichcontains9M
word boxes. Each word is generated from a 90K
English lexicon and over 1,400 Google Fonts.
SynthText is generated for scene text detection
containing 7M word boxes. The texts are rendered
onto scene are cropped and used for training.
III. Summarization model was trained using two
different datasets. The inshorts dataset which
contains news articles along with their headings.
CNN/DailyMail non-anonymized summarization
dataset. It contains two features. One is the article
which contains text of news articles and the
highlights which contains the joined text of
highlights which is the target summary.
IV. Our training process involved training the feature
prediction network followed by training the
vocoder independently. This was predominantly
done using the LJ speech dataset consisting of
13,100 short audio clips of a single speaker reading
passages from 7 non-fiction books which also
includes a transcription for each clip. The length of
the clips vary from 1 to 10 seconds andtheyhavean
approximate length of 24 hours in total.
4.2. HARDWARE AND SCHEDULE
We trained our models on NVIDIA P100 GPUs using
hyper-parameters described throughout the paper. We
trained the segmentation, recognition, summarization
models for 12-24 hours(approx.) each. Feature prediction
and vocoder were trained for 1-2 weeks(approx.) each.
4.3. OPTIMIZERS AND LOSS
Model Optimizer Loss
DBnet Adam Dice
Recognition Adam CTC
Transformer Adam Cross Entropy
Aligner Adam with
weight decay
Mean Squared
Error
Vocoder Adam for
discriminator
and generator
Hinge loss
Table-1: Optimizers and loss used to train different
models
4.4. EVALUATION
We evaluated the OCR model(segmentation +
recognition) with real time scene text images. We compared
the model with tesseract and Easyocr engines and achieved
better results than them. Sample results of this model is
shown in Figure 8.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 393
Fig-8: Result of OCR
During the evaluation of transformers, after
converting the source text into encoded vectors and
modelling the summarized text, during inference mode, the
ground truth targets are not known. Therefore, the outputs
from the previous step is passed as an input. In contrast to
the teacher-forcing method used while training. We
randomly selected 50 samples from the test dataset and
evaluated the model. The context between the each
evaluated result and the ground truth were similar. Sample
results of this model is shown below.
Sample Result-1:
Source: summstart ride hailing startup uber s main rival in
southeast asia, grab has announced plans to open a research
and development centre in bengaluru. the startup is looking
to hire around 200 engineers in india to focus on developing
its payments service grabpay. however, grab s engineering
vp arul kumaravel said the company has no plans of
expanding its on demand cab services to india. summend
Model output : summstart uber to open research centre in
bengaluru summend
Ground truth : summstart uber rival grab to open research
centre in bengaluru summend
Sample Result-2:
Source: summstart wrestlers geeta and babita phogat along
with their father mahavir attended aamir khan s 52nd
birthday celebrations at his mumbai residence. dangal is
based on mahavir and his daughters. the film s director
nitesh tiwari and actors aparshakti khurrana, fatima shaikh
and sakshi tanwar were also spotted at the party. shah rukh
khan and jackie shroff were among the other guests.
summend
Model output: summstart geeta babita phogat attend
birthday party s laga aamir summend
Ground truth: summstart geeta, babita, phogatfamilyattend
aamir s birthday party summend
Generating speech during inference to predict the
next frame, the output from previousstepistakenasinputin
contrast to teacher-forcing, we generate random text
sequences evaluated on the model. The spectrograms
predicted from the model is converted to audiousinggriffin-
lim algorithm to check themappingofcharacterembeddings
to the phonons, ignoring the quality of the speech.
Generative vocoder is used to generate high quality
speech from the spectrogram. Thesynthesizedaudioisrated
in respect to their clarity and natural sounding speech. A
sample evaluation of this model is shown below.
Sample input:
For although the Chinese took impressions from
wood blocks engraved in relief for centuries before the
woodcutters of the Netherlands, by a similar process.
Sample output:
5. CONCLUSIONS
These models segmentation, text recognition,
extractive and abstractive summarisation, spectrogram and
speech synthesis are targeted at summarising the content in
the image given and then converting itintoaudio. Withthese
models, the user will have over 90% accuracy in converting
the image with text into audio.
To handle a wide range of words, we can add the
entire English vocabulary which requires high performance
GPU’s. We have trained the models only with English
language. Further, the models can be trained with different
languages for wider usage. As of now, the OCR model deals
with a specific professional style of font. This could be
further improved by training it with fonts of different styles.
The audio output could be made more human-like by
training the model more vigorously.
ACKNOWLEDGEMENT
We extend our gratitude to the Department of
Computer Science and Engineering, PSG College of
Technology for supporting us throughouttheresearchwork.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 394
REFERENCES
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob,
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin, “Attention is all you
need”, 2017, In Advances in Neural Information
Processing Systems, pages 6000–6010.
[2] K. U. Manjari, "Extractive Summarization of Telugu
Documents using TextRank Algorithm," 2020Fourth
International Conference on I-SMAC (IoT in Social,
Mobile, Analytics and Cloud) (I-SMAC), 2020, pp.
678-683,doi:10.1109/I-SMAC49090.2020.9243568.
[3] Liao, Minghui and Wan, Zhaoyi and Yao, Cong and
Chen, Kai and Bai, Xiang, “Real-time Scene Text
Detection with Differentiable Binarization”, 2020,
Proc. AAAI.
[4] Baek, Jeonghun and Matsui, Yusuke and Aizawa,
Kiyoharu, “What If We Only Use Real Datasets for
Scene Text Recognition? Toward Scene Text
Recognition With Fewer Labels”, IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR), 2021.
[5] E. Variani, T. Bagby, K. Lahouel, E. McDermott andM.
Bacchiani, "Sampled Connectionist Temporal
Classification," 2018 IEEE International Conference
on Acoustics, Speech andSignal Processing(ICASSP),
2018, pp. 4959-4963, doi:
10.1109/ICASSP.2018.8461929.
[6] J. Shen et al., "Natural TTS Synthesis by Conditioning
Wavenet on MEL Spectrogram Predictions," 2018
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2018, pp. 4779-
4783, doi: 10.1109/ICASSP.2018.8461368.
[7] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran,
B. Sengupta and A. A. Bharath, "Generative
Adversarial Networks: An Overview," in IEEE Signal
Processing Magazine, vol. 35, no. 1, pp. 53-65, Jan.
2018, doi: 10.1109/MSP.2017.2765202.
[8] K. N. Haque, R. Rana and B. W. Schuller, "High-
Fidelity Audio Generation and Representation
Learning With Guided Adversarial Autoencoder," in
IEEE Access, vol. 8, pp. 223509-223528, 2020, doi:
10.1109/ACCESS.2020.3040797.
[9] R. Dey and F. M. Salem, "Gate-variants of Gated
Recurrent Unit (GRU) neural networks," 2017 IEEE
60th International Midwest Symposium on Circuits
and Systems (MWSCAS), 2017, pp. 1597-1600, doi:
10.1109/MWSCAS.2017.8053243.
[10] K. Kumar, “MelGAN: Generative Adversarial
Networks for Conditional Waveform Synthesis”.

Mais conteúdo relacionado

Semelhante a Digest Podcast Summarizes Research Papers

IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET Journal
 
IRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural NetworkIRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural NetworkIRJET Journal
 
IRJET- Semantic Segmentation using Deep Learning
IRJET- Semantic Segmentation using Deep LearningIRJET- Semantic Segmentation using Deep Learning
IRJET- Semantic Segmentation using Deep LearningIRJET Journal
 
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET Journal
 
IRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry PiIRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry PiIRJET Journal
 
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...IRJET Journal
 
IRJET- Rice QA using Deep Learning
IRJET- Rice QA using Deep LearningIRJET- Rice QA using Deep Learning
IRJET- Rice QA using Deep LearningIRJET Journal
 
Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...IJECEIAES
 
FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...
FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...
FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...TELKOMNIKA JOURNAL
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaIRJET Journal
 
IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...
IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...
IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...IRJET Journal
 
Medical Herb Identification and It’s Benefits
Medical Herb Identification and It’s BenefitsMedical Herb Identification and It’s Benefits
Medical Herb Identification and It’s BenefitsIRJET Journal
 
REVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNNREVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNNIRJET Journal
 
IRJET- Intelligent Character Recognition of Handwritten Characters using ...
IRJET-  	  Intelligent Character Recognition of Handwritten Characters using ...IRJET-  	  Intelligent Character Recognition of Handwritten Characters using ...
IRJET- Intelligent Character Recognition of Handwritten Characters using ...IRJET Journal
 
Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...
Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...
Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...IJECEIAES
 
IRJET- Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...
IRJET-  	  Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...IRJET-  	  Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...
IRJET- Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...IRJET Journal
 
IRJET- Mango Classification using Convolutional Neural Networks
IRJET- Mango Classification using Convolutional Neural NetworksIRJET- Mango Classification using Convolutional Neural Networks
IRJET- Mango Classification using Convolutional Neural NetworksIRJET Journal
 
Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...
Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...
Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...SBGC
 
Performance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO Objects
Performance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO ObjectsPerformance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO Objects
Performance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO ObjectsIJECEIAES
 
Traffic Sign Recognition Model
Traffic Sign Recognition ModelTraffic Sign Recognition Model
Traffic Sign Recognition ModelIRJET Journal
 

Semelhante a Digest Podcast Summarizes Research Papers (20)

IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...
 
IRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural NetworkIRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural Network
 
IRJET- Semantic Segmentation using Deep Learning
IRJET- Semantic Segmentation using Deep LearningIRJET- Semantic Segmentation using Deep Learning
IRJET- Semantic Segmentation using Deep Learning
 
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character RecognitionIRJET- Automatic Data Collection from Forms using Optical Character Recognition
IRJET- Automatic Data Collection from Forms using Optical Character Recognition
 
IRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry PiIRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
 
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
 
IRJET- Rice QA using Deep Learning
IRJET- Rice QA using Deep LearningIRJET- Rice QA using Deep Learning
IRJET- Rice QA using Deep Learning
 
Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...
 
FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...
FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...
FPGA Based Pattern Generation and Synchonization for High Speed Structured Li...
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
 
IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...
IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...
IMPROVEMENT IN IMAGE DENOISING OF HANDWRITTEN DIGITS USING AUTOENCODERS IN DE...
 
Medical Herb Identification and It’s Benefits
Medical Herb Identification and It’s BenefitsMedical Herb Identification and It’s Benefits
Medical Herb Identification and It’s Benefits
 
REVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNNREVIEW ON OBJECT DETECTION WITH CNN
REVIEW ON OBJECT DETECTION WITH CNN
 
IRJET- Intelligent Character Recognition of Handwritten Characters using ...
IRJET-  	  Intelligent Character Recognition of Handwritten Characters using ...IRJET-  	  Intelligent Character Recognition of Handwritten Characters using ...
IRJET- Intelligent Character Recognition of Handwritten Characters using ...
 
Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...
Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...
Black Box Model based Self Healing Solution for Stuck at Faults in Digital Ci...
 
IRJET- Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...
IRJET-  	  Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...IRJET-  	  Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...
IRJET- Jeevn-Net: Brain Tumor Segmentation using Cascaded U-Net & Overall...
 
IRJET- Mango Classification using Convolutional Neural Networks
IRJET- Mango Classification using Convolutional Neural NetworksIRJET- Mango Classification using Convolutional Neural Networks
IRJET- Mango Classification using Convolutional Neural Networks
 
Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...
Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...
Java networking 2012 ieee projects @ Seabirds ( Chennai, Bangalore, Hyderabad...
 
Performance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO Objects
Performance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO ObjectsPerformance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO Objects
Performance Evaluation of Fine-tuned Faster R-CNN on specific MS COCO Objects
 
Traffic Sign Recognition Model
Traffic Sign Recognition ModelTraffic Sign Recognition Model
Traffic Sign Recognition Model
 

Mais de IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

Mais de IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Último

Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...KrishnaveniKrishnara1
 
Machine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfMachine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfadeyimikaipaye
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptx10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptxAdityaGoogle
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studydhruvamdhruvil123
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackinghadarpinhas1
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Introduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptxIntroduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptxPavan Mohan Neelamraju
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxLina Kadam
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliNimot Muili
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 

Último (20)

Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
 
Machine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfMachine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdf
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptx10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptx
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain study
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and tracking
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Introduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptxIntroduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptx
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptx
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 

Digest Podcast Summarizes Research Papers

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 388 DIGEST PODCAST Mitran R1, Kamleshwar T2, Vignesh N3, Sri Harsha Arigapudi4, Nitin S5 1,2,3,4,5UG Student, Department of Computer Science and Engineering, PSG College of Technology, Coimbatore. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - This is the age of instant gratification. Browsing the entire Publication is dawdle , hence we have proposed an application that summarizes all your reading’s in a snap of time using AI technologies. The system is composed of Optical Character Recognition(OCR) engine to convert the image to text and transformer to summarise the text, dispensing recurrent networks followed by feature prediction network that maps character embeddings to mel-spectrogram and a GAN based vocoder to convert spectrogram to time based waveforms. Through extensive experiments we demonstrate digest podcast ability to recognize, summarize, speech synthesis for summarized audio generation. Our code can be found at https://github.com/mitran27/Digest_Cassette. Index terms: Optical Character Recognition, segmentation, summarisation, mel-spectrogram, GAN, speech synthesis. 1.INTRODUCTION Reading and interpreting a protracted segment of text is a challenging task. The above process in natural scene text is generally divided into three major tasks like Optical Character Recognition, Summarization(interpretation) and synthesizing the audio. Fig-1: block diagram of digest podcast system architecture The promise of deep learning architectures to produce probability distributions for applications such as natural images, audio containing speech, natural language corpora. There are very few applications over the decade to combine different techniques to dominate the field. Due to the wide range of vocabulary in the language and speech generated by the model was not promising. We propose a new podcast application which overcome the above drawbacks. In the proposed application, thearchitectureisbuilt with the state of the art networks after a bunch of research studies to find better models with their hyperparameters. The OCR model is built using the scene text detection and text recognition using the computer vision architectures which are proved to give promising results. The summarization model is built using the transformer eschewing recurrent networks and relying entirely on self- attention mechanism, which allows more parallelization to reach the new state of the art model. The choice of abstractive or extractive summarization is given to the user. The podcast model which generates the natural speechfrom the summarized text is built using two independent networks for feature prediction and speech synthesis. The layers in the model are chosen carefully to avoid artifacts in the sound. Hyperparameters are chosen properly to generate synthetic speech more natural comparedtohuman speech. 2. RELATED WORK Recognizing tasks/problems have been efficiently addressed by Hidden Markov Model, Recurrent Neural Networks,LongShort-Term Memory(LSTM)network.These methods use artificial neural networks or stochastic models to find the probability distribution of the output. Although these methods are efficient, they have a drawback of limited variation of inputs. So, we use Transformers[1], which uses attention mechanisms to find the output. Transformersisan architecture where each word in the text will paya weighted attention to the other words in the text, which makes the model work with large sentences. Transformer model exhibits a high generalization capacity oftextwhichmakesit state of the art in the NLP domain. Text Rank algorithm[2] which was inspired by the PageRank algorithm created by Larry Page. In place of web pages, we use sentences. Similarity between any two
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 389 sentences is used as an equivalent to the webpagetransition probability. The similarity scores are stored in a square matrix, similar to the matrix M usedfor PageRank.TextRank is an extractive and unsupervised text summarization technique. Segmentation results can more accurately describe scene text of various shapes such as curve text.However,the post-processing of binarizationisessential forsegmentation. DBnet[3] suggests trainable dynamic binarization had a greater impact on predicting the coordinates of the text. Recognition involves extracting the text from the input image using the boundingboxesobtainedfromthetext detection model and the text is predicted using a model containing a stack of VGG layers and the output is aligned and translated using CTC[5] loss function. Baek suggested that using real scene text datasets[4] had significant impact than synthetically generated text dataset. Connectionist Temporal Classification[5] suggests that training a sequence-to-sequence language model requires computation of loss to fine tune the model, vanilla models utilize cross entropy forfindinglossderivatives.Text recognition is sequence to sequence modelsbutsuffersfrom a drawback that they do not have fixed output length and they are not aligned with the input. So, in order to Label Unsegmented Sequence Data, we use CTC which computes loss by finding all the best possible path of the target using conditional probability from the probability distribution of classes available from all time steps. A sequence to sequence model with attention and location awareness[6] is used to convert the summarized text to spectrograms, which is an intermediate feature for audio using an encoder-decoder GRU[9]. Generative Adversarial Network[7] is an unsupervised learningtask inmachinelearningthatinvolves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset. High-Fidelity Audio Generation andRepresentation Learning With Guided Adversarial Autoencoder[8]thatHiFi- GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarial, along with two additional losses for improving training stability and model performance. MelGAN[10]hasgeneratorwhichconsistsofstack of up-sampling layers with dilated convolutionstoimprove the receptive field and a multiscale discriminator which examines audio at different scales since audio has structure at different levels. 3. MODEL ARCHITECTURE Our proposed system consists of six components, shown in Figure 1 (1) Semantic segmentation model with differentiable binarization (2) Temporal recognition network with CTC (3) Abstractive summarization model using attention mechanism (4) Extractive summarization algorithm using PageRank algorithm with word2vec embeddings (5) Feature prediction network with attention and locationawareness whichpredictsmel-spectrogram (6) Generative model generates time domain waveforms conditioned on mel-spectrogram 3.1. DIFFERENTIABLE BINARIZATION SEGMENTATION NETWORK DBnet uses an differentiable binarization map to adaptively set threshold for segmentationmaptopredictthe result. Spatial dynamic thresholdsfoundtobemoreeffective than static threshold. Static binarization if(Pi,j>threshold) Bi,j=1 else Bi,j=0 Dynamic binarization Bi,j = div(1,1+exp(-K*(Pi,j-Ti,j))) where B is the final binary map, P is the prediction map, T is the threshold map, K is a random value empirically set to 50. The static binarization is not differentiable. Thus it cannot be trained along with the segmentation network.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 390 Fig-2: Feature Pyramid Network(FPN) The segmentation architecture of our method is shown in Figure 2. Initially the image is passed through feature pyramid network and the output features are concatenated and the final prediction is passed through prediction and threshold networks which consists of couple of transposed convolution layers to up sample to its original size. Our model works well with light weight backbone, with the backbone of RESNET 18. Features were taken at three stages at the scale of (8x,16x,32x) of the original size. The outputs of the FPN produced features of dimension dmodel = 256. The loss is computed using DICE loss for prediction map and binary map and L1 loss for threshold map. Losses are compared to the ground truth map of the original image. 3.2. RECOGNITION Temporal image recognition is performed in word images to recognize the words in the text format. This is achieved by constructing a model which consists of VGG network for feature prediction, GRCNN can be used in cases where high accuracy is needed neglecting the speed of the model. The features are transposed to temporal dimension and passed to two layered bi-directional LSTM model with 1024 units to predict the output temporally(Figure 3). The output from the model is not aligned with the input image. So ConnectionistTemporal Classification(CTC)model isused because CTC algorithm is alignment free, CTC works by summing the probability of all possible alignments. Fig-3: Prediction from the LSTM model CTC is a neural network which uses associated scoring functions to train recurrent networks such as LSTM, GRU to tackle sequence applications where time is variable. CTC is differentiable with output probabilities from the recurrent network, since it sums the score of each alignment(Figure 4), gradient is computed for the loss function with respect to the output and back propagation is done to train the model. Fig-4: CTC of the output probabilities 3.3. ABSTRACTIVE SUMMARIZATION The abstractive summarization isachieved byusing sequence to sequence models recurrent networks, LSTM, GRU are firmly established in text summarization. Theusage of these models restricts their prediction in short range. In contrast, transformer networks allows to capture longer range. We used self-attention network asa buildingblock for transformers with embedding layer for word embeddings and positional encoding for positional embeddings since positional embeddings works for variable sentences. Fig-5: Scaled Dot-Product Attention The word embedding along with its position are passed to the encoder where attention mechanism is used (Figure 5) to learn by mapping a query and a setofkey-value pairs to an output vector. The basic idea is that, if the model is aware of the abstract, then it provides high level interpretation vectors. A decoder with similar architecture mentioned in encoder is used to obtain the output prediction using the interpreted vectors. 3.4. EXTRACTIVE SUMMARIZATION The extractive approach involves picking up the significant phrases from the given document, which is
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 391 achieved by creating a sentence graph and measuring the importance of each sentence within the graph, based on incoming relationships. The value of the incoming relationships are measured using cosine distance between sentence embeddings. The created graph passed into the Page Rank algorithm to rank the sentences in the document. The dimensions for the word embeddings are empirically set to 100. Fig-6: Sentence graph 3.5. ALIGNER – INTERMEDIATE FEATURE REPRESENTATION Converting text(200 tokens) to audio waveform(200K tokens) is not possible. So we chose low level intermediaterepresentation:mel-spectrogramtoactas a bridge between the aligner and the vocoder. Mel- spectrograms scale down the audio to 256x times. Mel- spectrograms are created by applying short-time Fourier transform (STFT) with a frame size 50ms and hop size 12.5ms and STFT is transformed to mel-scale using 80 channel mel-filterbank followed by log dynamic range compression, because humans respond to signals logarithmically. Since the dimension of spectrograms are morethan the dimension of text, sequence to sequence model is used. The model consists of encoder and decoder with attention and locational awareness. Attention mechanism is used to handle long range sequences. Experiments without locational awareness produce madethedecodertorepeat or ignore some sequences(Figure 7). Fig-7: Output without locational awareness The encoder converts the character sequence to character embedding with dimension 512 which are passed through bi-directional GRUlayercontaining512units(256in each direction) to generate the encoded features. The encoded features are consumed bythedecodermodel which contains two layer of auto regressive GRU of 1024 units and an attention network with dimension 128. The attention network is represented by the below equation. a(st-1, hi) = vT tanh(W1 hi + W2 St-1 + W3(conv( a(st-2, h))) where a represents the attention scores, s represents the decoder hidden states, h represents the encoder hidden states, W1,W2,W3 represents the linear network, conv represents the convolution network to extract features from the attention scores. The convolution network for locational awarenessis built with kernel size 31 and dimension 32. A pre-net is built using feedforward network to teacher-force the original previous timestamp to the decoder. Experimentswithdimension128and256madethe decoder alter the input rather than attending the encoded features. Dimension 64 with high dropouts(0.5) forced the decoder to attend the encoded features because there is no full access to the teacher-forced inputs. This makes the decoder predict correctly even if there is a slip inthecurrent time stamp input(previous time stamp output). The output from the decoder is passed through feature enhancer network to predict a residual to enhance the overall construction of spectrogram which consists of four layers of residual networks of dimension 256 with tanh activation and dropouts are added at each layer to avoid overfitting. We minimize the mean squared error (MSE) from before and after the enhancer to aid convergence. 3.6. VOCODER – GANS Converting spectrograms to time domain waveforms auto-regressively is a tedious and time consuming process.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 392 3.6.1. GENERATOR A spectrogram is approximately 256x smaller than the audio. A Generative model is built using (8x,8x,2x,2x) up-sampling. Experiments were carried to find the best scaling factors to attain 256x. Firstly, a feature extractor network is used to extract features with a kernel size 7 and dimension 512 to feed the up-sampling layers. Up-sampling layer network consists of transposed convolution with its correspondingscalefactorandsomeconvolutional networks to increase the receptive field. Experiments are made for increasing the receptive field, after transposed convolution the receptive field would be low. Receptive field of a stack of dilated convolution increases exponentially with number of layers, while vanilla convolutions increases linearly. The stack dilation convolution consists of three dilated RESNETs with kernel size 5 and dilations 1,5,25. The receptive field of the network should look like a fully balanced symmetric tree with kernel size as the branch factor. Finally, convolutional layer is used to predictthefinal audio(dimension of size 1) and an tanh activation function. 3.6.2. DISCRIMINATOR Discriminators are used to classify whether the given sample is real or fake. We adopt a multiscale discriminator to operate on differentscalesofaudio.Feature Pyramid Networks(Figure 2) are used to extract features at different stages on scale 2x,4x,8x with kernel size 15 and dimension 64. The features fromeachstagearepassedto the discriminator block. Each discriminator block learnsfeaturesofdifferent frequency range of the audio by stack ofconvolutional layers with kernel size 41 and stride 4 and a final layer to predict the binary class(real/fake). 4. EXPERIMENTS AND RESULTS 4.1. TRAINING SETUP I. Segmentationmodel wastrainedonRobustReading Competition challenges datasetslikeDocVQA2020- 21, SROIE 2019, COCO-Text 2017 which contained document image dataset along with their coordinates of the words. II. Recognition network was trained using two major synthetic datasets. MJSynth (MJ) whichcontains9M word boxes. Each word is generated from a 90K English lexicon and over 1,400 Google Fonts. SynthText is generated for scene text detection containing 7M word boxes. The texts are rendered onto scene are cropped and used for training. III. Summarization model was trained using two different datasets. The inshorts dataset which contains news articles along with their headings. CNN/DailyMail non-anonymized summarization dataset. It contains two features. One is the article which contains text of news articles and the highlights which contains the joined text of highlights which is the target summary. IV. Our training process involved training the feature prediction network followed by training the vocoder independently. This was predominantly done using the LJ speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books which also includes a transcription for each clip. The length of the clips vary from 1 to 10 seconds andtheyhavean approximate length of 24 hours in total. 4.2. HARDWARE AND SCHEDULE We trained our models on NVIDIA P100 GPUs using hyper-parameters described throughout the paper. We trained the segmentation, recognition, summarization models for 12-24 hours(approx.) each. Feature prediction and vocoder were trained for 1-2 weeks(approx.) each. 4.3. OPTIMIZERS AND LOSS Model Optimizer Loss DBnet Adam Dice Recognition Adam CTC Transformer Adam Cross Entropy Aligner Adam with weight decay Mean Squared Error Vocoder Adam for discriminator and generator Hinge loss Table-1: Optimizers and loss used to train different models 4.4. EVALUATION We evaluated the OCR model(segmentation + recognition) with real time scene text images. We compared the model with tesseract and Easyocr engines and achieved better results than them. Sample results of this model is shown in Figure 8.
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 393 Fig-8: Result of OCR During the evaluation of transformers, after converting the source text into encoded vectors and modelling the summarized text, during inference mode, the ground truth targets are not known. Therefore, the outputs from the previous step is passed as an input. In contrast to the teacher-forcing method used while training. We randomly selected 50 samples from the test dataset and evaluated the model. The context between the each evaluated result and the ground truth were similar. Sample results of this model is shown below. Sample Result-1: Source: summstart ride hailing startup uber s main rival in southeast asia, grab has announced plans to open a research and development centre in bengaluru. the startup is looking to hire around 200 engineers in india to focus on developing its payments service grabpay. however, grab s engineering vp arul kumaravel said the company has no plans of expanding its on demand cab services to india. summend Model output : summstart uber to open research centre in bengaluru summend Ground truth : summstart uber rival grab to open research centre in bengaluru summend Sample Result-2: Source: summstart wrestlers geeta and babita phogat along with their father mahavir attended aamir khan s 52nd birthday celebrations at his mumbai residence. dangal is based on mahavir and his daughters. the film s director nitesh tiwari and actors aparshakti khurrana, fatima shaikh and sakshi tanwar were also spotted at the party. shah rukh khan and jackie shroff were among the other guests. summend Model output: summstart geeta babita phogat attend birthday party s laga aamir summend Ground truth: summstart geeta, babita, phogatfamilyattend aamir s birthday party summend Generating speech during inference to predict the next frame, the output from previousstepistakenasinputin contrast to teacher-forcing, we generate random text sequences evaluated on the model. The spectrograms predicted from the model is converted to audiousinggriffin- lim algorithm to check themappingofcharacterembeddings to the phonons, ignoring the quality of the speech. Generative vocoder is used to generate high quality speech from the spectrogram. Thesynthesizedaudioisrated in respect to their clarity and natural sounding speech. A sample evaluation of this model is shown below. Sample input: For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process. Sample output: 5. CONCLUSIONS These models segmentation, text recognition, extractive and abstractive summarisation, spectrogram and speech synthesis are targeted at summarising the content in the image given and then converting itintoaudio. Withthese models, the user will have over 90% accuracy in converting the image with text into audio. To handle a wide range of words, we can add the entire English vocabulary which requires high performance GPU’s. We have trained the models only with English language. Further, the models can be trained with different languages for wider usage. As of now, the OCR model deals with a specific professional style of font. This could be further improved by training it with fonts of different styles. The audio output could be made more human-like by training the model more vigorously. ACKNOWLEDGEMENT We extend our gratitude to the Department of Computer Science and Engineering, PSG College of Technology for supporting us throughouttheresearchwork.
  • 7. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 394 REFERENCES [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob, Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need”, 2017, In Advances in Neural Information Processing Systems, pages 6000–6010. [2] K. U. Manjari, "Extractive Summarization of Telugu Documents using TextRank Algorithm," 2020Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 2020, pp. 678-683,doi:10.1109/I-SMAC49090.2020.9243568. [3] Liao, Minghui and Wan, Zhaoyi and Yao, Cong and Chen, Kai and Bai, Xiang, “Real-time Scene Text Detection with Differentiable Binarization”, 2020, Proc. AAAI. [4] Baek, Jeonghun and Matsui, Yusuke and Aizawa, Kiyoharu, “What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels”, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [5] E. Variani, T. Bagby, K. Lahouel, E. McDermott andM. Bacchiani, "Sampled Connectionist Temporal Classification," 2018 IEEE International Conference on Acoustics, Speech andSignal Processing(ICASSP), 2018, pp. 4959-4963, doi: 10.1109/ICASSP.2018.8461929. [6] J. Shen et al., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779- 4783, doi: 10.1109/ICASSP.2018.8461368. [7] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta and A. A. Bharath, "Generative Adversarial Networks: An Overview," in IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53-65, Jan. 2018, doi: 10.1109/MSP.2017.2765202. [8] K. N. Haque, R. Rana and B. W. Schuller, "High- Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder," in IEEE Access, vol. 8, pp. 223509-223528, 2020, doi: 10.1109/ACCESS.2020.3040797. [9] R. Dey and F. M. Salem, "Gate-variants of Gated Recurrent Unit (GRU) neural networks," 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), 2017, pp. 1597-1600, doi: 10.1109/MWSCAS.2017.8053243. [10] K. Kumar, “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis”.