SlideShare uma empresa Scribd logo
1 de 82
Baixar para ler offline
Digital Humanities 101 - 2013/2014 - Course 7
Digital Humanities Laboratory
Andrea Mazzei and Fr´d´ric Kaplan
e e
andrea.mazzei,frederic.kaplan@epfl.ch
o

A Job offer
• Running an OCR transcription of 320 pages
• about 60 hours of work
• 25 CHF / hour.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

2
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

3
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

4
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

5
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

6
o

Results of the peer grading process

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

7
o

New projects

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

8
o

Venitian opera staging and machinery
• A project that find way for better understanding and visualizing opera staging
based on evidences found in historical sources (treatise, music prints, etc.)
• Rosand, E. 1990. Opera in Seventeenth-Century Venice : The Creation of a Genre.
Berkeley : University of California Press.
• Bjurstr¨m, P. 1962. Giacomo Torelli and Baroque Stage Design. Stockholm :
o
Almqvist and Wiksell.
˜ a
• Leclerc, H. 1987. Venise et l’av`nement de l’op´ra public A l’ˆge baroque. Paris :
e
o
A. Colin.
• Larson, O. K. 1980. Giacomo Torelli, Sir Philip Skippon, and Stage Machinery for
the Venetian Opera, Theatre Journal, Vol. 32, No. 4, pp. 448-457.
www.jstor.org/stable/3207407
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

9
o

Venetian storytelling in the Middle-Age
• Marin Sanudo was an historical writer. In contrast to others writer of the
epoch, he wrote a diary noting all the events happend in Venice. Of
course it is not the only one diary wrote in Venice. Imagine how to use
this personal information.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

10
o

Looking at music printing typefaces
• A project that looks at the different music typefaces used in Venetian
prints. Typical questions are : the size of the typeface, when they were
used, for what repertoire, what printers used them, etc.
• Agee, R. 1998. The Gardano Music Printing Firms, 1569-1611.
Rochester, University of Rochester Press.
• Bernstein, J. 1998. Music Printing in Renaissance Venice. The Scotto
Press (1539-1572). Oxford, Oxford University Press.
• Bernstein, J. 2001. Print Culture and Music in Sixteenth-Century Venice.
Oxford, Oxford University Press.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

11
o

Music at San Marco
• A project that can look at how the capella di San Marco evolved over
time : how many musicians, where they played in the Basilica, what they
played, etc.
• Selfridge-Field, E. 1994. Venetian instrumental music from Gabrieli to
Vivaldi. New York : Dover.
• Moretti, L. 2004. Jacopo Sansovino and Adrian Willaert at St Mark’s,
Early Music History, Vol. 23, pp. 153-184.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

12
o

Venetian music prints in libraries today
• A project that looks at the production of music prints in Venice and
where they are hold today in libraries and archives around the world
• The Repertoire International des Source Musicales, Series A/I on music
prints. http ://www.rism.info [will be made available digitally for the
project]

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

13
o

Semester 1 : Content of each course
• (1) 19.09 Introduction to the course / Live Tweeting and Collective note
taking
• (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment
• (3) 2.10 Introduction to the Venice Time Machine project / Zotero
• 9.10 No course
• (4) 16.10 Digitization techniques / Deadline first assignment
• (5) 23.10 Datafication / Presentation of projects
• (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first
assignment
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

14
o

Semester 1 : Content of each course
• (7) 6.11 Pattern recognition / OCR / Semantic disambiguation
• (8) 13.11 Historical Geographical Information Systems, Procedural modelling
/ City Engine / Deadline Project selection
• (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap
• (10) 27.11 Cultural heritage interfaces and visualisation / Museographic
experiences
• 4.12 Group work on the projects
• 11.12 Oral exam / Presentation of projects / Deadline Project blog
• 18.12 Oral exam / Presentation of projects
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

15
o

Today's course
• Printed Text Recognition
• Hand Writing Recognition
• Ornament Recognition
• Text Mining and semantic disambiguation : Extracting named entities
(people, places, etc.) in a text using Wikipedia

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

16
o

Part I : Printed Text Recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

17
o

OCR : Optical Character Recognition
A system that provides a full recognition of all the printed characters by
simply scanning the support.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

18
o

Mori et al. (1992). Historical review of OCR R&D
• 1940 : The first version of OCR
• 1950 : The first OCR machines appear
• 1960 - 1965 : First generation OCR : NOF, Farrington 360, IBM 1418.
They all used a special font
• 1965 - 1975 : Second generation OCR : IBM 1287, NEC, Toshiba. They
could also recognize constrained hand-printed alpha-numerals.
• 1975 - 1985 : Third generation OCR : IBM 1975, Poor print quality or
handwritten characters. 275 fonts. Handwriting recognition.
• 1986 - Today : OCR to the people
Eikvil, L. (1993). Optical Character Recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

19
o

OCR capabilities
The recognition performance depends on the type and number of fonts
recognized.
• Fixed font : the sytem can recognize only one font
• Multi font : the system can recognize multiple fonts
• Omni font : the system can recognize most nonstylized fonts without
having to maintain huge databases of specific font information

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

20
o

Omni-font OCR Overview Of Processing

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

21
o

Preprocessing : Text Lines Straightening

Zhang, Z., & Tan, C. L. (2002, June). Straightening warped text lines using polynomial regression. In Image Processing. 2002.
Proceedings. 2002 International Conference on (Vol. 3, pp. 977-980). IEEE.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

22
o

Preprocessing : Character Detection
• Image binarization using local adaptive thresholding

• Character detection using region growing-based methods. PROBLEM !

Eikvil, L. (1993). Optical Character Recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

23
o

Segmentation Problems : Touching and fragmented characters
• Joints will occur if the document is a dark photocopy or if it is scanned
at a low threshold.
• Joints are common if the fonts are serifed.
• The characters may be split if the document stems from a light
photocopy or is scanned at a high threshold

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

24
o

Segmentation Problems : Distinguishing noise from text
Dots and accents may be mistaken for noise, and vice versa.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

25
o

Segmentation Problems : Mistaking graphics for text
This leads to non-text being sent or text not being sent to recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

26
o

Feature Extraction
From each character several features can be extracted :
• Rasterized pixels
• Geometric moment invariant
• Morphological features

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

27
o

Feature Extraction : Zoning
MxN zones are computed as average gray level from the image of the
character.

Due Trier, O., Jain, A. K., & Taxt, T. (1996). Feature extraction methods
for character recognition-a survey. Pattern recognition, 29(4), 641-662

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

28
o

Feature Extraction : Projection Profile

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

29
o

Feature Extraction : Structural Analysis
Strokes, bays, end-points, intersections between lines and loops.
High tolerance to noise and style variations.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

30
o

Classification
The principal approaches to decision-theoretic recognition are minimum
distance classifiers, statistical classifiers and neural networks.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

31
o

Matching

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

32
o

Optimum statistical classifiers.
• Bayesian classifier. Given an unknown symbol described by its feature
vector, the probability that the symbol belongs to the class c is computed
for all classes c = 1...N. The symbol is then assigned the class which
gives the maximum probability.
• ...

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

33
o

Post Processing : Grouping
From symbols to strings using symbols proximity
Eikvil, L. (1993). Optical Character Recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

34
o

Post Processing : Error Detection and Correction
• Use of rules defining the syntax of the word. Ex. In English the k never
appears after the h.
• Use of dictionaries. If the word is not in the dictionary, an error has been
detected, and may be corrected by changing the word into the most
similar word.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

35
o

Self-learning
Modern OCR systems enlarge the database of characters when new fonts
are encountered. Character recognition is based on the database previously
built in, which contains the important features related to the characters
which are known already. It is necessary that this database is able to self
expand as more and more new characters are met in order to increase the
recognition ability.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

36
o

Handwriting Recognition (HWR)

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

37
o

Offline HWR : Many difficult problems
• Stroke ordering

• Broken lines

• Merged blobs

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

38
o

From Offline to Simulated Online

It is not reliable
• What order were the strokes written in ?
• Doubled-up line segments ?
• Ink blobs ?
• Spurious joins between letters ?
• Missing joins ?
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

39
o

Segmentation : Strokes Extraction

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

40
o

Segmentation : Segments Fitting
Robustly cut letters into segments
Match multiple segments to detect letters
Easier than matching whole letter

Hutchison L. Handwriting Recognition for Genealogical Records - Course 7 | 2013
Digital Humanities 101 - 2013/2014

41
o

Analytical Approach
It treats a word as a collection of simpler sub-units such as characters
• Segmentation of the word into these units
• Identification of the units
• Word-level interpretation using a predefined lexicon

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

42
o

Problems with the Analytical Approach
• segmentation ambiguity : deciding where to segment the word image

• variability of segment shape : determining the identity of each segment

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

43
o

Holistic Matching
Treats the word as a single, indivisible entity and attempts to recognize it
using features of the word as whole.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

44
o

Advantages of the Holystic Matching
Coarticulation effect, i.e., the changes in the appearance of a character
as a function of the shapes of neighboring characters

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

45
o

Advantages of the Holystic Matching
Orthogonality of holistic features : information about the word that
is clearly orthogonal to the knowledge of characters in it and it stands to
reason that the introduction of this knowledge should improve recognition

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

46
o

Advantages of the Holystic Matching
Evidence from psychological studies : psychological studies of
reading points towards the fact that humans do not, in general, read words
letter by letter.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

47
o

Dynamic Global Search
Assemble word spelling from possible letter readings

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

48
o

Result 1

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

49
o

Result 2

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

50
o

Result 3

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

51
o

ABBYY Fine Reader : A Case Study

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

52
o

Scanned Document

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

53
o

Image Rotation Adjustment

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

54
o

Image Rotation Adjustment

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

55
o

First Extraction

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

56
o

Synthetizing the Table

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

57
o

Second Extraction

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

58
o

Retrieval of the ornaments from the Hand-Press Period

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

59
o

Problem Statement
For millions of intact books and tens of millions of loose pages, the
provenance of the manuscripts may be in doubt or completely unknown

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

60
o

Manual Solution
Human experts are capable to regain the provenance by examining
linguistic, cultural and/or stylistic clues.
However, such experts are rare and this investigation is a time-consuming
process.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

61
o

Automatic Solution
By comparing the initial letters in the manuscript to annotated initial
letters whose origin is known, the provenance can be determined.
This process can be automatized

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

62
o

What are the Challenges ?

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

63
o

Ornament Segmentation
Ornament(s) detection and localization with respect to the page reference system.

˜
Baudrier, E., Busson, S., Corsini, S., Delalandre, M., LandrA c , J., &
Morain-Nicolier, F. (2009, July). Retrieval of the ornaments from 2013/2014 - Course 7 | 2013
Digital Humanities 101 - the hand-press

64
o

A Compression Based Distance Measure for Texture
The distance between a window and an annotated initial letter is
denoted as :
mpegSize(W , IL) + mpegSize(IL, W )
distCK 1(W , IL) =
−1
mpegSize(W , W ) + mpegSize(IL, IL)
The first image supplied to mpegSize is assigned as an I frame
and the second becomes a P frame.
Campana, B. J., & Keogh, E. J. (2010). A compression-based
distance measure for texture. Statistical Analysis and Data
Mining, 3(6), 381-398
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

65
o

Properties of CK1 Distance Measure
Efficient, robust and parameter-free texture similarity measure.
Rotation, Colour and Illumination Invariant.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

66
o

Gabor Filters

Images are convolved with each filter.
The standard deviation and mean of each response => 48 length vector
Vector Euclidean distance
Wang, X., Ding, X., & Liu, C. (2005). Gabor filters-based feature extraction for
character recognition. Pattern recognition, 38(3), 369-379
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

67
o

Data Sets

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

68
o

Experimental Results

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

69
o

Part II : Text mining and semantic disambiguation

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

70
o

Case study : Extracting named entities (people, places,
etc.) in a text using Wikipedia

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

71
o

Using Wikipedia
• A Unique ID : A Wikipedia article is identified by a unique name, which is
the article title itself. The respective URL of a Wikipedia article can be
created by concatenating the words in the article title and appending it
to the URL root of the Wikipedia

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

72
o

Using Wikipedia
• Redirections : Some entities can have multiple names. In order to address
this issue, Wikipedia has some article titles that do not have a
substantive article and are only redirected to a different Wikipedia article
with another title. This mechanism is called redirection. Redirections are
used for other purposes such as spelling resolution (e.g. the article title
Oranges is redirected to Orange) and abbreviation resolution (e.g. the
article title UCLA is redirected to University of California, Los Angeles).

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

73
o

Using Wikipedia
• Disambiguation pages : A disambiguation page is created for ambiguous
entity names and it enumerates all the possible articles for that name. For
example, the disambiguation page for Paris enumerates 25 places called
Paris (in America, Canada and Europe), 33 people having Paris as name
or surname, 10 television series and films, whose title contains the word
Paris, etc.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

74
o

Using Wikipedia
• Outgoing links : In the body text of the Wikipedia article there are
references (links) to other articles. The references are within pairs of
double square brackets.
• Infobox : An infobox is a fixed-format table designed to be added to the
top right-hand corner of articles to consistently present a summary of
some unifying aspect that the articles share and sometimes to improve
navigation to other interrelated articles.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

75
o

3 steps
• Data extraction : A (sequence of) word(s) is extracted from a ”Le
Temps” article (e.g. Le Paris). Set the right boundaries in the extracted
data (e.g. from ”Le Paris” is retrieved the ”Paris” ).
• Disambiguation : Retrieve all the Wikipedia articles whose title contains
the word ”Paris” (e.g. Paris (France), Paris (Texas), Paris Hilton, Paris
(mythology), etc). Find the Wikipedia article that maximizes the
agreement between the content extracted from Wikipedia and the
context of the ”Le Temps” article.
• Entity classification : Classify the entity as place, person, company, etc,
based on the chosen Wikipedia article
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

76
o

Disambiguation strategy

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

77
o

(1) Data extraction
• The first step is the extraction of possible named entities. This step is
based on the fact that the named entities consist of capitalized words.
The rules that we apply for the extraction of possible named mentions in
the text are the following :
• Retrieve all the capitalized words (e.g. England)
• Retrieve recursively terms T0 of the form T1 Particle T2, where Particle is one of a possessive
pronoun, and the terms T1 and T2 are capitalized words or sequences of capitalized words
(e.g. University of Edinburgh, European Society of Athletic Therapy and Training)
• In French, some entities can contain non-capitalized words, after some specific words.
Therefore, we retrieve non-capitalized words if they are followed by a word that is contained
in a predefined set of words (e.g. Union, Biblioth`que, etc). For example the Union
e
sovietique is considered as entity.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

78
o

(2) Disambiguation
• The disambiguation process employs a vector space model, in which a
vectorial representation of the processed article is compared with the
vectorial representations of the Wikipedia entities.
• The vectorial representation of the processed article (article vector) is a
vector having all the possible entities of the specific article obtained
during the previous step, while the vectorial representation of a Wikipedia
article (Wikipedia vector) is a vector having all the outgoing links in the
body text of the article.
• Once a Wikipedia article is identified as the most similar to the processed
article, the article vector is updated by adopting the features of the
chosen Wikipedia vector.
Digital Humanities 101 - 2013/2014 - Course 7 | 2013

79
o

(3) Entity classification
• The last step is to classify the entities into persons, places, companies,
etc.
• Ex : It the entity a place ? If the Wikipedia article contains an infobox,
then we retrieve it and we search for specific tags in it that can classify
the entity as a place.
• If the Wikipedia article does not have an infobox, then we use the first
sentence of the body text.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

80
o

Partial results
• We have implemented the algorithm and tested it on a subset of the
database
• Our current estimation of the number of entity retrieved is 85 %
• Main issue : Some entites are not in Wikipedia.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

81
o

From Wikipedia to Wikipast
• The First principle of Wikipedia is that it is an encyclopedia. Not all
entites are allowed. Sourcing is important but secondary
• On going discussion with Wikimedia to create an alternative to
Wikipedia, allowing page on any person, place, etc. from the past as long
at it is clearly sourced.

Digital Humanities 101 - 2013/2014 - Course 7 | 2013

82

Mais conteúdo relacionado

Destaque

Forms Processing UX of text and handwriting
Forms Processing UX of text and handwritingForms Processing UX of text and handwriting
Forms Processing UX of text and handwritingLiberteks
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) Systemiosrjce
 
Data visualization and digital humanities research
Data visualization and digital humanities researchData visualization and digital humanities research
Data visualization and digital humanities researchSusan Smith
 
Document Recognition Technologies
Document Recognition TechnologiesDocument Recognition Technologies
Document Recognition TechnologiesChris Riley ☁
 
UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...
UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...
UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...Ciro Mattia Gonano
 
ARIADNE: Final report on standards and project registry
ARIADNE: Final report on standards and project registryARIADNE: Final report on standards and project registry
ARIADNE: Final report on standards and project registryariadnenetwork
 
Using SKOS Vocabularies for Improving Web Search
Using SKOS Vocabularies for Improving Web SearchUsing SKOS Vocabularies for Improving Web Search
Using SKOS Vocabularies for Improving Web SearchBernhard Haslhofer
 
Mapping VRA Core 4.0 to the CIDOC/CRM ontology
Mapping VRA Core 4.0 to the CIDOC/CRM ontologyMapping VRA Core 4.0 to the CIDOC/CRM ontology
Mapping VRA Core 4.0 to the CIDOC/CRM ontologyGiannis Tsakonas
 
GraphSense - Real-time Insight into Virtual Currency Ecosystems
GraphSense - Real-time Insight into Virtual Currency EcosystemsGraphSense - Real-time Insight into Virtual Currency Ecosystems
GraphSense - Real-time Insight into Virtual Currency EcosystemsBernhard Haslhofer
 
CIDOC CRM+FRBRoo: an Integrated View of Museum and Library Information
CIDOC CRM+FRBRoo: an Integrated View of Museum and Library InformationCIDOC CRM+FRBRoo: an Integrated View of Museum and Library Information
CIDOC CRM+FRBRoo: an Integrated View of Museum and Library InformationPatrick Le Boeuf
 
OMR Design And Evaluation System
OMR Design And Evaluation SystemOMR Design And Evaluation System
OMR Design And Evaluation SystemMridul Rawat
 
Interopérabilité de l'information bibliographique et muséologique
Interopérabilité de l'information bibliographique et muséologiqueInteropérabilité de l'information bibliographique et muséologique
Interopérabilité de l'information bibliographique et muséologiquePatrick Le Boeuf
 
Types and Annotations for CIDOC CRM Properties - Presentation
Types and Annotations for CIDOC CRM Properties - PresentationTypes and Annotations for CIDOC CRM Properties - Presentation
Types and Annotations for CIDOC CRM Properties - PresentationVladimir Alexiev, PhD, PMP
 
Lidar for heritage mapping in India
Lidar for heritage mapping in IndiaLidar for heritage mapping in India
Lidar for heritage mapping in IndiaArchana Joshi
 
Cultural Mapping & Digital Storytelling in a Social Context
Cultural Mapping & Digital Storytelling in a Social ContextCultural Mapping & Digital Storytelling in a Social Context
Cultural Mapping & Digital Storytelling in a Social ContextStefan Kolgen
 
Document Recognition Market Landscape
Document Recognition Market LandscapeDocument Recognition Market Landscape
Document Recognition Market LandscapeChris Riley ☁
 

Destaque (20)

Forms Processing UX of text and handwriting
Forms Processing UX of text and handwritingForms Processing UX of text and handwriting
Forms Processing UX of text and handwriting
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
 
Data visualization and digital humanities research
Data visualization and digital humanities researchData visualization and digital humanities research
Data visualization and digital humanities research
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
Document Recognition Technologies
Document Recognition TechnologiesDocument Recognition Technologies
Document Recognition Technologies
 
05a
05a05a
05a
 
UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...
UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...
UN’ESPERIENZA DI RAPPRESENTAZIONE DI DATI DI CATALOGHI DIGITALI IN LINKED OPE...
 
Cidoc2009 H20
Cidoc2009 H20Cidoc2009 H20
Cidoc2009 H20
 
ARIADNE: Final report on standards and project registry
ARIADNE: Final report on standards and project registryARIADNE: Final report on standards and project registry
ARIADNE: Final report on standards and project registry
 
Using SKOS Vocabularies for Improving Web Search
Using SKOS Vocabularies for Improving Web SearchUsing SKOS Vocabularies for Improving Web Search
Using SKOS Vocabularies for Improving Web Search
 
The Story behind Maphub
The Story behind MaphubThe Story behind Maphub
The Story behind Maphub
 
Mapping VRA Core 4.0 to the CIDOC/CRM ontology
Mapping VRA Core 4.0 to the CIDOC/CRM ontologyMapping VRA Core 4.0 to the CIDOC/CRM ontology
Mapping VRA Core 4.0 to the CIDOC/CRM ontology
 
GraphSense - Real-time Insight into Virtual Currency Ecosystems
GraphSense - Real-time Insight into Virtual Currency EcosystemsGraphSense - Real-time Insight into Virtual Currency Ecosystems
GraphSense - Real-time Insight into Virtual Currency Ecosystems
 
CIDOC CRM+FRBRoo: an Integrated View of Museum and Library Information
CIDOC CRM+FRBRoo: an Integrated View of Museum and Library InformationCIDOC CRM+FRBRoo: an Integrated View of Museum and Library Information
CIDOC CRM+FRBRoo: an Integrated View of Museum and Library Information
 
OMR Design And Evaluation System
OMR Design And Evaluation SystemOMR Design And Evaluation System
OMR Design And Evaluation System
 
Interopérabilité de l'information bibliographique et muséologique
Interopérabilité de l'information bibliographique et muséologiqueInteropérabilité de l'information bibliographique et muséologique
Interopérabilité de l'information bibliographique et muséologique
 
Types and Annotations for CIDOC CRM Properties - Presentation
Types and Annotations for CIDOC CRM Properties - PresentationTypes and Annotations for CIDOC CRM Properties - Presentation
Types and Annotations for CIDOC CRM Properties - Presentation
 
Lidar for heritage mapping in India
Lidar for heritage mapping in IndiaLidar for heritage mapping in India
Lidar for heritage mapping in India
 
Cultural Mapping & Digital Storytelling in a Social Context
Cultural Mapping & Digital Storytelling in a Social ContextCultural Mapping & Digital Storytelling in a Social Context
Cultural Mapping & Digital Storytelling in a Social Context
 
Document Recognition Market Landscape
Document Recognition Market LandscapeDocument Recognition Market Landscape
Document Recognition Market Landscape
 

Semelhante a DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processingyoukayaslam
 
Nicolae_DUTA_CV.doc
Nicolae_DUTA_CV.docNicolae_DUTA_CV.doc
Nicolae_DUTA_CV.docbutest
 
Nicolae_DUTA_CV.doc
Nicolae_DUTA_CV.docNicolae_DUTA_CV.doc
Nicolae_DUTA_CV.docbutest
 
DH101 2013/2014 Projects
DH101 2013/2014 ProjectsDH101 2013/2014 Projects
DH101 2013/2014 ProjectsFrederic Kaplan
 
DH101 2013/2014 course 10 - 3d printing, Javascript data visualization
DH101 2013/2014 course 10 - 3d printing, Javascript data visualization DH101 2013/2014 course 10 - 3d printing, Javascript data visualization
DH101 2013/2014 course 10 - 3d printing, Javascript data visualization Frederic Kaplan
 
Degree in Cinematography. Third Year. Editing, Post-Production and Sound 20...
 Degree in Cinematography. Third Year.  Editing, Post-Production and Sound 20... Degree in Cinematography. Third Year.  Editing, Post-Production and Sound 20...
Degree in Cinematography. Third Year. Editing, Post-Production and Sound 20...Bande á Part Escuela de Cine
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for UniversitiesEducational Technology
 
IWST 2013: Intro
IWST 2013: IntroIWST 2013: Intro
IWST 2013: IntroESUG
 
Virtual Heritage: combining the past with modern technology - OpenArch Confer...
Virtual Heritage: combining the past with modern technology - OpenArch Confer...Virtual Heritage: combining the past with modern technology - OpenArch Confer...
Virtual Heritage: combining the past with modern technology - OpenArch Confer...EXARC
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldaelang
 
Libs 602 portfolio presentation
Libs 602 portfolio presentationLibs 602 portfolio presentation
Libs 602 portfolio presentationKimberly Rivera
 
B.A in cinematography. Third year. Editing, Post-Production and Sound. 2016...
 B.A in cinematography. Third year.  Editing, Post-Production and Sound. 2016... B.A in cinematography. Third year.  Editing, Post-Production and Sound. 2016...
B.A in cinematography. Third year. Editing, Post-Production and Sound. 2016...Bande á Part Escuela de Cine
 
CEN standards - Cinema Expert Group, Brussels 15 October 2010
CEN standards - Cinema Expert Group, Brussels 15 October 2010CEN standards - Cinema Expert Group, Brussels 15 October 2010
CEN standards - Cinema Expert Group, Brussels 15 October 2010Marco Rendina
 
Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017
Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017 Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017
Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017 Bande á Part Escuela de Cine
 
2012 a rebeloijmir
2012 a rebeloijmir2012 a rebeloijmir
2012 a rebeloijmirMiguel Ponce
 

Semelhante a DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation (20)

way_topics.ppt
way_topics.pptway_topics.ppt
way_topics.ppt
 
topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processing
 
Nicolae_DUTA_CV.doc
Nicolae_DUTA_CV.docNicolae_DUTA_CV.doc
Nicolae_DUTA_CV.doc
 
Nicolae_DUTA_CV.doc
Nicolae_DUTA_CV.docNicolae_DUTA_CV.doc
Nicolae_DUTA_CV.doc
 
DH101 2013/2014 Projects
DH101 2013/2014 ProjectsDH101 2013/2014 Projects
DH101 2013/2014 Projects
 
DH101 2013/2014 course 10 - 3d printing, Javascript data visualization
DH101 2013/2014 course 10 - 3d printing, Javascript data visualization DH101 2013/2014 course 10 - 3d printing, Javascript data visualization
DH101 2013/2014 course 10 - 3d printing, Javascript data visualization
 
Degree in Cinematography. Third Year. Editing, Post-Production and Sound 20...
 Degree in Cinematography. Third Year.  Editing, Post-Production and Sound 20... Degree in Cinematography. Third Year.  Editing, Post-Production and Sound 20...
Degree in Cinematography. Third Year. Editing, Post-Production and Sound 20...
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
IWST 2013: Intro
IWST 2013: IntroIWST 2013: Intro
IWST 2013: Intro
 
Virtual Heritage: combining the past with modern technology - OpenArch Confer...
Virtual Heritage: combining the past with modern technology - OpenArch Confer...Virtual Heritage: combining the past with modern technology - OpenArch Confer...
Virtual Heritage: combining the past with modern technology - OpenArch Confer...
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
 
Libs 602 portfolio presentation
Libs 602 portfolio presentationLibs 602 portfolio presentation
Libs 602 portfolio presentation
 
B.A in cinematography. Third year. Editing, Post-Production and Sound. 2016...
 B.A in cinematography. Third year.  Editing, Post-Production and Sound. 2016... B.A in cinematography. Third year.  Editing, Post-Production and Sound. 2016...
B.A in cinematography. Third year. Editing, Post-Production and Sound. 2016...
 
Program 2nd . DEGREE IN CINEMATOGRAPHY 2016/2017
Program 2nd . DEGREE IN CINEMATOGRAPHY 2016/2017Program 2nd . DEGREE IN CINEMATOGRAPHY 2016/2017
Program 2nd . DEGREE IN CINEMATOGRAPHY 2016/2017
 
Program 2nd year B.A IN CINEMATOGRAPHY 2016/2017
Program 2nd year B.A IN CINEMATOGRAPHY 2016/2017Program 2nd year B.A IN CINEMATOGRAPHY 2016/2017
Program 2nd year B.A IN CINEMATOGRAPHY 2016/2017
 
CEN standards - Cinema Expert Group, Brussels 15 October 2010
CEN standards - Cinema Expert Group, Brussels 15 October 2010CEN standards - Cinema Expert Group, Brussels 15 October 2010
CEN standards - Cinema Expert Group, Brussels 15 October 2010
 
Ijetcas14 371
Ijetcas14 371Ijetcas14 371
Ijetcas14 371
 
Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017
Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017 Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017
Program HIGHER EDUCATION IN CINEMATOGRAPHY. Second Year 2016 /2017
 
QR Codes online event 28.2.2018
QR Codes online event 28.2.2018QR Codes online event 28.2.2018
QR Codes online event 28.2.2018
 
2012 a rebeloijmir
2012 a rebeloijmir2012 a rebeloijmir
2012 a rebeloijmir
 

Mais de Frederic Kaplan

Les technologies absorbantes
Les technologies absorbantesLes technologies absorbantes
Les technologies absorbantesFrederic Kaplan
 
Transformer 4 millions d'articles de presse en un système d'information
Transformer 4 millions d'articles de presse en un système d'informationTransformer 4 millions d'articles de presse en un système d'information
Transformer 4 millions d'articles de presse en un système d'informationFrederic Kaplan
 
L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...
L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...
L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...Frederic Kaplan
 
3d scanning for digital heritage
3d scanning for digital heritage3d scanning for digital heritage
3d scanning for digital heritageFrederic Kaplan
 
Franziska Frey 2 / DHV13
Franziska Frey 2 / DHV13Franziska Frey 2 / DHV13
Franziska Frey 2 / DHV13Frederic Kaplan
 
Franziska Frey 1 / DHV13
Franziska Frey 1 / DHV13Franziska Frey 1 / DHV13
Franziska Frey 1 / DHV13Frederic Kaplan
 
Color and appearance information in 3d models
Color and appearance information in 3d modelsColor and appearance information in 3d models
Color and appearance information in 3d modelsFrederic Kaplan
 
Digital Humanities Venice Fall School: Introduction
Digital Humanities Venice Fall School: IntroductionDigital Humanities Venice Fall School: Introduction
Digital Humanities Venice Fall School: IntroductionFrederic Kaplan
 
La question de la langue à l'époque de Google
La question de la langue à l'époque de GoogleLa question de la langue à l'époque de Google
La question de la langue à l'époque de GoogleFrederic Kaplan
 
Edition numérique de Jean-Jacques Rousseau
Edition numérique de Jean-Jacques RousseauEdition numérique de Jean-Jacques Rousseau
Edition numérique de Jean-Jacques RousseauFrederic Kaplan
 
Les métamorphoses de la valeur
Les métamorphoses de la valeurLes métamorphoses de la valeur
Les métamorphoses de la valeurFrederic Kaplan
 
Développer la lecture sociale en bibliothèque
Développer la lecture sociale en bibliothèqueDévelopper la lecture sociale en bibliothèque
Développer la lecture sociale en bibliothèqueFrederic Kaplan
 
Introduction au capitalisme linguistique
Introduction au capitalisme linguistiqueIntroduction au capitalisme linguistique
Introduction au capitalisme linguistiqueFrederic Kaplan
 

Mais de Frederic Kaplan (19)

Les technologies absorbantes
Les technologies absorbantesLes technologies absorbantes
Les technologies absorbantes
 
La langue comme capital
La langue comme capitalLa langue comme capital
La langue comme capital
 
Transformer 4 millions d'articles de presse en un système d'information
Transformer 4 millions d'articles de presse en un système d'informationTransformer 4 millions d'articles de presse en un système d'information
Transformer 4 millions d'articles de presse en un système d'information
 
L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...
L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...
L'historien et l'algorithme : Présentation aux Entretiens du Nouveau Monde In...
 
3d scanning for digital heritage
3d scanning for digital heritage3d scanning for digital heritage
3d scanning for digital heritage
 
3d scanning pipeline
3d scanning pipeline3d scanning pipeline
3d scanning pipeline
 
Franziska Frey 2 / DHV13
Franziska Frey 2 / DHV13Franziska Frey 2 / DHV13
Franziska Frey 2 / DHV13
 
Franziska Frey 1 / DHV13
Franziska Frey 1 / DHV13Franziska Frey 1 / DHV13
Franziska Frey 1 / DHV13
 
3d scanning techniques
3d scanning techniques3d scanning techniques
3d scanning techniques
 
Color and appearance information in 3d models
Color and appearance information in 3d modelsColor and appearance information in 3d models
Color and appearance information in 3d models
 
3d from images
3d from images3d from images
3d from images
 
Pellegrini small
Pellegrini smallPellegrini small
Pellegrini small
 
Digital Humanities Venice Fall School: Introduction
Digital Humanities Venice Fall School: IntroductionDigital Humanities Venice Fall School: Introduction
Digital Humanities Venice Fall School: Introduction
 
La question de la langue à l'époque de Google
La question de la langue à l'époque de GoogleLa question de la langue à l'époque de Google
La question de la langue à l'époque de Google
 
Edition numérique de Jean-Jacques Rousseau
Edition numérique de Jean-Jacques RousseauEdition numérique de Jean-Jacques Rousseau
Edition numérique de Jean-Jacques Rousseau
 
QB1 : The story
QB1 : The storyQB1 : The story
QB1 : The story
 
Les métamorphoses de la valeur
Les métamorphoses de la valeurLes métamorphoses de la valeur
Les métamorphoses de la valeur
 
Développer la lecture sociale en bibliothèque
Développer la lecture sociale en bibliothèqueDévelopper la lecture sociale en bibliothèque
Développer la lecture sociale en bibliothèque
 
Introduction au capitalisme linguistique
Introduction au capitalisme linguistiqueIntroduction au capitalisme linguistique
Introduction au capitalisme linguistique
 

Último

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Último (20)

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

  • 1. Digital Humanities 101 - 2013/2014 - Course 7 Digital Humanities Laboratory Andrea Mazzei and Fr´d´ric Kaplan e e andrea.mazzei,frederic.kaplan@epfl.ch
  • 2. o A Job offer • Running an OCR transcription of 320 pages • about 60 hours of work • 25 CHF / hour. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 2
  • 3. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 3
  • 4. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 4
  • 5. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 5
  • 6. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 6
  • 7. o Results of the peer grading process Digital Humanities 101 - 2013/2014 - Course 7 | 2013 7
  • 8. o New projects Digital Humanities 101 - 2013/2014 - Course 7 | 2013 8
  • 9. o Venitian opera staging and machinery • A project that find way for better understanding and visualizing opera staging based on evidences found in historical sources (treatise, music prints, etc.) • Rosand, E. 1990. Opera in Seventeenth-Century Venice : The Creation of a Genre. Berkeley : University of California Press. • Bjurstr¨m, P. 1962. Giacomo Torelli and Baroque Stage Design. Stockholm : o Almqvist and Wiksell. ˜ a • Leclerc, H. 1987. Venise et l’av`nement de l’op´ra public A l’ˆge baroque. Paris : e o A. Colin. • Larson, O. K. 1980. Giacomo Torelli, Sir Philip Skippon, and Stage Machinery for the Venetian Opera, Theatre Journal, Vol. 32, No. 4, pp. 448-457. www.jstor.org/stable/3207407 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 9
  • 10. o Venetian storytelling in the Middle-Age • Marin Sanudo was an historical writer. In contrast to others writer of the epoch, he wrote a diary noting all the events happend in Venice. Of course it is not the only one diary wrote in Venice. Imagine how to use this personal information. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 10
  • 11. o Looking at music printing typefaces • A project that looks at the different music typefaces used in Venetian prints. Typical questions are : the size of the typeface, when they were used, for what repertoire, what printers used them, etc. • Agee, R. 1998. The Gardano Music Printing Firms, 1569-1611. Rochester, University of Rochester Press. • Bernstein, J. 1998. Music Printing in Renaissance Venice. The Scotto Press (1539-1572). Oxford, Oxford University Press. • Bernstein, J. 2001. Print Culture and Music in Sixteenth-Century Venice. Oxford, Oxford University Press. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 11
  • 12. o Music at San Marco • A project that can look at how the capella di San Marco evolved over time : how many musicians, where they played in the Basilica, what they played, etc. • Selfridge-Field, E. 1994. Venetian instrumental music from Gabrieli to Vivaldi. New York : Dover. • Moretti, L. 2004. Jacopo Sansovino and Adrian Willaert at St Mark’s, Early Music History, Vol. 23, pp. 153-184. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 12
  • 13. o Venetian music prints in libraries today • A project that looks at the production of music prints in Venice and where they are hold today in libraries and archives around the world • The Repertoire International des Source Musicales, Series A/I on music prints. http ://www.rism.info [will be made available digitally for the project] Digital Humanities 101 - 2013/2014 - Course 7 | 2013 13
  • 14. o Semester 1 : Content of each course • (1) 19.09 Introduction to the course / Live Tweeting and Collective note taking • (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment • (3) 2.10 Introduction to the Venice Time Machine project / Zotero • 9.10 No course • (4) 16.10 Digitization techniques / Deadline first assignment • (5) 23.10 Datafication / Presentation of projects • (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first assignment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 14
  • 15. o Semester 1 : Content of each course • (7) 6.11 Pattern recognition / OCR / Semantic disambiguation • (8) 13.11 Historical Geographical Information Systems, Procedural modelling / City Engine / Deadline Project selection • (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap • (10) 27.11 Cultural heritage interfaces and visualisation / Museographic experiences • 4.12 Group work on the projects • 11.12 Oral exam / Presentation of projects / Deadline Project blog • 18.12 Oral exam / Presentation of projects Digital Humanities 101 - 2013/2014 - Course 7 | 2013 15
  • 16. o Today's course • Printed Text Recognition • Hand Writing Recognition • Ornament Recognition • Text Mining and semantic disambiguation : Extracting named entities (people, places, etc.) in a text using Wikipedia Digital Humanities 101 - 2013/2014 - Course 7 | 2013 16
  • 17. o Part I : Printed Text Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 17
  • 18. o OCR : Optical Character Recognition A system that provides a full recognition of all the printed characters by simply scanning the support. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 18
  • 19. o Mori et al. (1992). Historical review of OCR R&D • 1940 : The first version of OCR • 1950 : The first OCR machines appear • 1960 - 1965 : First generation OCR : NOF, Farrington 360, IBM 1418. They all used a special font • 1965 - 1975 : Second generation OCR : IBM 1287, NEC, Toshiba. They could also recognize constrained hand-printed alpha-numerals. • 1975 - 1985 : Third generation OCR : IBM 1975, Poor print quality or handwritten characters. 275 fonts. Handwriting recognition. • 1986 - Today : OCR to the people Eikvil, L. (1993). Optical Character Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 19
  • 20. o OCR capabilities The recognition performance depends on the type and number of fonts recognized. • Fixed font : the sytem can recognize only one font • Multi font : the system can recognize multiple fonts • Omni font : the system can recognize most nonstylized fonts without having to maintain huge databases of specific font information Digital Humanities 101 - 2013/2014 - Course 7 | 2013 20
  • 21. o Omni-font OCR Overview Of Processing Digital Humanities 101 - 2013/2014 - Course 7 | 2013 21
  • 22. o Preprocessing : Text Lines Straightening Zhang, Z., & Tan, C. L. (2002, June). Straightening warped text lines using polynomial regression. In Image Processing. 2002. Proceedings. 2002 International Conference on (Vol. 3, pp. 977-980). IEEE. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 22
  • 23. o Preprocessing : Character Detection • Image binarization using local adaptive thresholding • Character detection using region growing-based methods. PROBLEM ! Eikvil, L. (1993). Optical Character Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 23
  • 24. o Segmentation Problems : Touching and fragmented characters • Joints will occur if the document is a dark photocopy or if it is scanned at a low threshold. • Joints are common if the fonts are serifed. • The characters may be split if the document stems from a light photocopy or is scanned at a high threshold Digital Humanities 101 - 2013/2014 - Course 7 | 2013 24
  • 25. o Segmentation Problems : Distinguishing noise from text Dots and accents may be mistaken for noise, and vice versa. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 25
  • 26. o Segmentation Problems : Mistaking graphics for text This leads to non-text being sent or text not being sent to recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 26
  • 27. o Feature Extraction From each character several features can be extracted : • Rasterized pixels • Geometric moment invariant • Morphological features Digital Humanities 101 - 2013/2014 - Course 7 | 2013 27
  • 28. o Feature Extraction : Zoning MxN zones are computed as average gray level from the image of the character. Due Trier, O., Jain, A. K., & Taxt, T. (1996). Feature extraction methods for character recognition-a survey. Pattern recognition, 29(4), 641-662 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 28
  • 29. o Feature Extraction : Projection Profile Digital Humanities 101 - 2013/2014 - Course 7 | 2013 29
  • 30. o Feature Extraction : Structural Analysis Strokes, bays, end-points, intersections between lines and loops. High tolerance to noise and style variations. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 30
  • 31. o Classification The principal approaches to decision-theoretic recognition are minimum distance classifiers, statistical classifiers and neural networks. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 31
  • 32. o Matching Digital Humanities 101 - 2013/2014 - Course 7 | 2013 32
  • 33. o Optimum statistical classifiers. • Bayesian classifier. Given an unknown symbol described by its feature vector, the probability that the symbol belongs to the class c is computed for all classes c = 1...N. The symbol is then assigned the class which gives the maximum probability. • ... Digital Humanities 101 - 2013/2014 - Course 7 | 2013 33
  • 34. o Post Processing : Grouping From symbols to strings using symbols proximity Eikvil, L. (1993). Optical Character Recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 34
  • 35. o Post Processing : Error Detection and Correction • Use of rules defining the syntax of the word. Ex. In English the k never appears after the h. • Use of dictionaries. If the word is not in the dictionary, an error has been detected, and may be corrected by changing the word into the most similar word. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 35
  • 36. o Self-learning Modern OCR systems enlarge the database of characters when new fonts are encountered. Character recognition is based on the database previously built in, which contains the important features related to the characters which are known already. It is necessary that this database is able to self expand as more and more new characters are met in order to increase the recognition ability. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 36
  • 37. o Handwriting Recognition (HWR) Digital Humanities 101 - 2013/2014 - Course 7 | 2013 37
  • 38. o Offline HWR : Many difficult problems • Stroke ordering • Broken lines • Merged blobs Digital Humanities 101 - 2013/2014 - Course 7 | 2013 38
  • 39. o From Offline to Simulated Online It is not reliable • What order were the strokes written in ? • Doubled-up line segments ? • Ink blobs ? • Spurious joins between letters ? • Missing joins ? Digital Humanities 101 - 2013/2014 - Course 7 | 2013 39
  • 40. o Segmentation : Strokes Extraction Digital Humanities 101 - 2013/2014 - Course 7 | 2013 40
  • 41. o Segmentation : Segments Fitting Robustly cut letters into segments Match multiple segments to detect letters Easier than matching whole letter Hutchison L. Handwriting Recognition for Genealogical Records - Course 7 | 2013 Digital Humanities 101 - 2013/2014 41
  • 42. o Analytical Approach It treats a word as a collection of simpler sub-units such as characters • Segmentation of the word into these units • Identification of the units • Word-level interpretation using a predefined lexicon Digital Humanities 101 - 2013/2014 - Course 7 | 2013 42
  • 43. o Problems with the Analytical Approach • segmentation ambiguity : deciding where to segment the word image • variability of segment shape : determining the identity of each segment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 43
  • 44. o Holistic Matching Treats the word as a single, indivisible entity and attempts to recognize it using features of the word as whole. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 44
  • 45. o Advantages of the Holystic Matching Coarticulation effect, i.e., the changes in the appearance of a character as a function of the shapes of neighboring characters Digital Humanities 101 - 2013/2014 - Course 7 | 2013 45
  • 46. o Advantages of the Holystic Matching Orthogonality of holistic features : information about the word that is clearly orthogonal to the knowledge of characters in it and it stands to reason that the introduction of this knowledge should improve recognition Digital Humanities 101 - 2013/2014 - Course 7 | 2013 46
  • 47. o Advantages of the Holystic Matching Evidence from psychological studies : psychological studies of reading points towards the fact that humans do not, in general, read words letter by letter. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 47
  • 48. o Dynamic Global Search Assemble word spelling from possible letter readings Digital Humanities 101 - 2013/2014 - Course 7 | 2013 48
  • 49. o Result 1 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 49
  • 50. o Result 2 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 50
  • 51. o Result 3 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 51
  • 52. o ABBYY Fine Reader : A Case Study Digital Humanities 101 - 2013/2014 - Course 7 | 2013 52
  • 53. o Scanned Document Digital Humanities 101 - 2013/2014 - Course 7 | 2013 53
  • 54. o Image Rotation Adjustment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 54
  • 55. o Image Rotation Adjustment Digital Humanities 101 - 2013/2014 - Course 7 | 2013 55
  • 56. o First Extraction Digital Humanities 101 - 2013/2014 - Course 7 | 2013 56
  • 57. o Synthetizing the Table Digital Humanities 101 - 2013/2014 - Course 7 | 2013 57
  • 58. o Second Extraction Digital Humanities 101 - 2013/2014 - Course 7 | 2013 58
  • 59. o Retrieval of the ornaments from the Hand-Press Period Digital Humanities 101 - 2013/2014 - Course 7 | 2013 59
  • 60. o Problem Statement For millions of intact books and tens of millions of loose pages, the provenance of the manuscripts may be in doubt or completely unknown Digital Humanities 101 - 2013/2014 - Course 7 | 2013 60
  • 61. o Manual Solution Human experts are capable to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are rare and this investigation is a time-consuming process. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 61
  • 62. o Automatic Solution By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. This process can be automatized Digital Humanities 101 - 2013/2014 - Course 7 | 2013 62
  • 63. o What are the Challenges ? Digital Humanities 101 - 2013/2014 - Course 7 | 2013 63
  • 64. o Ornament Segmentation Ornament(s) detection and localization with respect to the page reference system. ˜ Baudrier, E., Busson, S., Corsini, S., Delalandre, M., LandrA c , J., & Morain-Nicolier, F. (2009, July). Retrieval of the ornaments from 2013/2014 - Course 7 | 2013 Digital Humanities 101 - the hand-press 64
  • 65. o A Compression Based Distance Measure for Texture The distance between a window and an annotated initial letter is denoted as : mpegSize(W , IL) + mpegSize(IL, W ) distCK 1(W , IL) = −1 mpegSize(W , W ) + mpegSize(IL, IL) The first image supplied to mpegSize is assigned as an I frame and the second becomes a P frame. Campana, B. J., & Keogh, E. J. (2010). A compression-based distance measure for texture. Statistical Analysis and Data Mining, 3(6), 381-398 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 65
  • 66. o Properties of CK1 Distance Measure Efficient, robust and parameter-free texture similarity measure. Rotation, Colour and Illumination Invariant. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 66
  • 67. o Gabor Filters Images are convolved with each filter. The standard deviation and mean of each response => 48 length vector Vector Euclidean distance Wang, X., Ding, X., & Liu, C. (2005). Gabor filters-based feature extraction for character recognition. Pattern recognition, 38(3), 369-379 Digital Humanities 101 - 2013/2014 - Course 7 | 2013 67
  • 68. o Data Sets Digital Humanities 101 - 2013/2014 - Course 7 | 2013 68
  • 69. o Experimental Results Digital Humanities 101 - 2013/2014 - Course 7 | 2013 69
  • 70. o Part II : Text mining and semantic disambiguation Digital Humanities 101 - 2013/2014 - Course 7 | 2013 70
  • 71. o Case study : Extracting named entities (people, places, etc.) in a text using Wikipedia Digital Humanities 101 - 2013/2014 - Course 7 | 2013 71
  • 72. o Using Wikipedia • A Unique ID : A Wikipedia article is identified by a unique name, which is the article title itself. The respective URL of a Wikipedia article can be created by concatenating the words in the article title and appending it to the URL root of the Wikipedia Digital Humanities 101 - 2013/2014 - Course 7 | 2013 72
  • 73. o Using Wikipedia • Redirections : Some entities can have multiple names. In order to address this issue, Wikipedia has some article titles that do not have a substantive article and are only redirected to a different Wikipedia article with another title. This mechanism is called redirection. Redirections are used for other purposes such as spelling resolution (e.g. the article title Oranges is redirected to Orange) and abbreviation resolution (e.g. the article title UCLA is redirected to University of California, Los Angeles). Digital Humanities 101 - 2013/2014 - Course 7 | 2013 73
  • 74. o Using Wikipedia • Disambiguation pages : A disambiguation page is created for ambiguous entity names and it enumerates all the possible articles for that name. For example, the disambiguation page for Paris enumerates 25 places called Paris (in America, Canada and Europe), 33 people having Paris as name or surname, 10 television series and films, whose title contains the word Paris, etc. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 74
  • 75. o Using Wikipedia • Outgoing links : In the body text of the Wikipedia article there are references (links) to other articles. The references are within pairs of double square brackets. • Infobox : An infobox is a fixed-format table designed to be added to the top right-hand corner of articles to consistently present a summary of some unifying aspect that the articles share and sometimes to improve navigation to other interrelated articles. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 75
  • 76. o 3 steps • Data extraction : A (sequence of) word(s) is extracted from a ”Le Temps” article (e.g. Le Paris). Set the right boundaries in the extracted data (e.g. from ”Le Paris” is retrieved the ”Paris” ). • Disambiguation : Retrieve all the Wikipedia articles whose title contains the word ”Paris” (e.g. Paris (France), Paris (Texas), Paris Hilton, Paris (mythology), etc). Find the Wikipedia article that maximizes the agreement between the content extracted from Wikipedia and the context of the ”Le Temps” article. • Entity classification : Classify the entity as place, person, company, etc, based on the chosen Wikipedia article Digital Humanities 101 - 2013/2014 - Course 7 | 2013 76
  • 77. o Disambiguation strategy Digital Humanities 101 - 2013/2014 - Course 7 | 2013 77
  • 78. o (1) Data extraction • The first step is the extraction of possible named entities. This step is based on the fact that the named entities consist of capitalized words. The rules that we apply for the extraction of possible named mentions in the text are the following : • Retrieve all the capitalized words (e.g. England) • Retrieve recursively terms T0 of the form T1 Particle T2, where Particle is one of a possessive pronoun, and the terms T1 and T2 are capitalized words or sequences of capitalized words (e.g. University of Edinburgh, European Society of Athletic Therapy and Training) • In French, some entities can contain non-capitalized words, after some specific words. Therefore, we retrieve non-capitalized words if they are followed by a word that is contained in a predefined set of words (e.g. Union, Biblioth`que, etc). For example the Union e sovietique is considered as entity. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 78
  • 79. o (2) Disambiguation • The disambiguation process employs a vector space model, in which a vectorial representation of the processed article is compared with the vectorial representations of the Wikipedia entities. • The vectorial representation of the processed article (article vector) is a vector having all the possible entities of the specific article obtained during the previous step, while the vectorial representation of a Wikipedia article (Wikipedia vector) is a vector having all the outgoing links in the body text of the article. • Once a Wikipedia article is identified as the most similar to the processed article, the article vector is updated by adopting the features of the chosen Wikipedia vector. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 79
  • 80. o (3) Entity classification • The last step is to classify the entities into persons, places, companies, etc. • Ex : It the entity a place ? If the Wikipedia article contains an infobox, then we retrieve it and we search for specific tags in it that can classify the entity as a place. • If the Wikipedia article does not have an infobox, then we use the first sentence of the body text. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 80
  • 81. o Partial results • We have implemented the algorithm and tested it on a subset of the database • Our current estimation of the number of entity retrieved is 85 % • Main issue : Some entites are not in Wikipedia. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 81
  • 82. o From Wikipedia to Wikipast • The First principle of Wikipedia is that it is an encyclopedia. Not all entites are allowed. Sourcing is important but secondary • On going discussion with Wikimedia to create an alternative to Wikipedia, allowing page on any person, place, etc. from the past as long at it is clearly sourced. Digital Humanities 101 - 2013/2014 - Course 7 | 2013 82