O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Exploration of Call Transcripts with MapReduce and Zipf’s Law

87 visualizações

Publicada em

This study implements a proof of concept
pipeline to capture web based call transcripts and produces
a word frequency dataset ready for textual analysis

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Exploration of Call Transcripts with MapReduce and Zipf’s Law

  1. 1. Exploration of Call Transcripts with MapReduce and Zipf’s Law Tom Donoghue School of Computing National College of Ireland Dublin, Ireland Email: x16103491@student.ncirl.ie Abstract—The activities associated with the pre-processing of text for downstream textual analysis may be well served by the Hadoop MapReduce paradigm. Text pre-processing steps com- prise a simple workflow which could benefit from parallelisation offered by MapReduce. This study implements a proof of concept pipeline to capture web based call transcripts and produces a word frequency dataset ready for textual analysis. The text based corpus is created by extracting call transcripts from an investment aggregation website. A Zipfian distribution of the call transcript’s question and answers section word frequency is created. The distribution is used to explore areas which may contain words that could have a significant meaning to executives. Words that may elevate or damage an executive’s presence when uttered on a call and hence allowing them to be better prepared ahead of future calls. I. INTRODUCTION The production, capture and storage of unstructured data continues to grow [1]. The corpora of unstructured textual data are gaining interest beyond the traditional domain of psychology and linguistics, and into the areas of data science and business [2]. This is aided by the ability to use clouded in- frastructure which brings higher levels of compute and storage. Business may find new ideas and knowledge via the analysis of textual data produced, consumed and communicated by their organisation. This may be of interest for both research and business as it could contain informational patterns which are not apparently obvious due to the volumes of text. The analysis of earnings calls is an area which attracts research by way of assessing linguistic impact on market reaction [3] [4]. This study is a proof of concept which, explores the creation pipeline producing part processed data subsets. The concept may demonstrate that such a pipeline could scale and benefit from the parallel processing capabilities which the MapReduce architecture is designed to support. These subsets may be further processed for specific textual analysis requirements. The business question for this study is: How might a Zipfian distribution of unigrams in a Question and Answer earnings call assist executive to be mindful of the words they use? Earnings calls transcripts are collected by web scraping the question and answer portion of seekingalpha.com1 . The source text is stored in a MongoDB collection and exported and uploaded to Hadoops distributed file system (HDFS). Mapre- duce aggregation jobs process the text and the results are 1https://seekingalpha.com/ [Accessed April 27, 2017] stored on a HBase database. A MapReduce extract job outputs extracts back to the HDFS which are downloaded for linguistic exploration and the production of the Zipfian distribution. The paper comprises the following sections: section 2 reviews the related work, section 3 describes the methodology, activities and challenges involved, section 4 presents the results and the conclusion and possible areas for further research are made in section 5. II. RELATED WORK There are many areas concerning the processing and anal- ysis of bodies of text, many beyond the scope of this explo- ration. A review of the literature pertaining to this study: Web scraping text, databases usage, application of MapReduce and Zipfian distribution of word frequency is covered below. A. Extracting Data from the Web Obtaining data for data analytics through the extraction of bodies of text from websites is an alternative when access to proprietary or other data is unavailable [5]. The author refers to web scrapping as the method of traversing a webpage(s) and selecting the desired data based on the pages HTML. Further reference to the number of tools available to assist the capture of the website data (e.g. Scrapy2 which also provides the ability to create a spider to crawl websites). The use of website data capture in [6] extract earnings call transcripts from seekingalpha.com to conduct their analysis of tone dispersal. B. Databases The arrival of new methods of deploying infrastructure and database management systems at scale as [1] find, enables the move to adopt a ‘schema on read approach. This means that the use of Hadoop and NoSQL databases are well positioned to ingest large volumes of unstructured data. As [7] confer HBase and MongoDB offer benefits which include: a flexible schema, access to distributed compute and storage and a lower cost of ownership when compared to relational databases. The ability to store unstructured data at scale in NoSQL databases also presents the issue of immature or nonstandard querying of the data [8]. Initial storage of source data is a fundamental requirement, and that these databases types are designed for 2https://scrapy.org/ [Accessed April 27, 2017]
  2. 2. the ingestion of unstructured data appears to indicate their fitness for purpose. C. MapReduce Once stored, the data is required to be transformed and cleansed for onward processing activities. The adoption of MapReduce as [9] suggest, to tackle this set of activities appears to be a valid approach due to the ability to apply succinct tasks to local copies of the data. As the authors find this data locality and parallelisation of tasks over distributed hardware infrastructure enables the processing to scale and cater for increasing data demands. Addressing the major ben- efits of the MapReduce paradigm [10] describe it as offering: programmers release to focus on the programming task rather than needing to understand the underlying infrastructure, to ratchet up or down the amount of compute and storage require- ments, reduced reliance on multiple database load, resilience to failures as processing nodes are replicated. The authors confer with [8] regarding the missing SQL ease of querying the data and the subsequent reliance on using proficient programmers to implement the MapReduce jobs. This lack of abstraction of MapReduce is addressed through the introduction of Pig and Hive [11]. A shift to meet the requirement of real time analysis, is provided by the arrival of Spark which uses memory over disk to execute jobs [1]. Spark may accelerate the departure from MapReduce by the increase in processing speed and greater task flexibility that it brings [12]. D. Word Frequency and Zipf Distributions The analysis of text often examines the frequency and ranks words by using Zipfs law [13]. This assists in the identification of unigram words which may require (and generally are) precluded from downstream analysis. The reasoning is that the most frequently and least ranked words contribute little to the meaning of the document, the former are known as stop words. In the commercial arena, custom word lexicons are used to select words of interest and hence the removal of stop words is unwarranted [14]. The authors conclude that Zipf distribution indicates the words which could confound word classification by ignoring their sheer prevalence in the text corpus. The area of the distribution which lends most interest is the middle range as [13] suggests being populated with words that have an association with or are descriptive of the subject matter. III. METHODOLOGY The methodology applied in this study follows the big data workflow approach as [15] discuss and is illustrated in Figure 1. The process comprises 4 steps: Extract Transform and Load (ELT), processing, outcome and feedback. This iterative approach enables a continuous refinement of the process with an objective of achieving the required results. The ETL step captures data, transforms it into the format required and stores it in a database or file system. The processing step conducts the compute effort which could comprise execution of MapReduce Fig. 1. Data process flow, based on [15] and machine learning algorithms. The outcome step is where the results are produced using visualisation. Scrapy is a webscraping package for python in which a spider is created which crawls the webpages(s) it is directed at. The installation and configuration of Scrapy is straight forward. The challenge comes in understanding the html structure of the page(s) to crawl. Seeking out the exact data to scrape from the page requires capability to use XPath and CSS selectors, and their many combinations which yield the required results. A page may offer a schema which may be followed with identifiable way signs to the desired item to retrieve. Other pages may offer little in the way of signage or hierarchies which are easier to traverse. The earnings pages have a single <div>which pinpoints the start of the call transcript, but from there each item is at the same level. As each call transcript has any number of pages and each page has any number of items traversal requires planning an exact routes. The question and answer section comprises items (html <p>tags) at the same level: <p>- Question : Analyst <p>- “I have a couple of questions please...?” <p>- “And then on...?” <p>- Answer : Exec 1 <p>- “Shall I start with...” <p>- Answer : Exec 2 <p>- “As Mike says...” The process pipeline for producing processed word based data subsets from unstructured text is illustrated in Figure 2 and described below. A. Call Transcripts Earnings calls transcripts are published on seekeingal- pha.com, which is a web based investment research aggregator, combining investor and industry knowledge. The calls selected for this proof of concept are taken from the drinks industry sec- tor (e.g. Diageo, Heineken NV, Molson Coors and Anheuser- Busch InBev SA/NV). The call transcripts are available for viewing by registered users. The site lists the earnings links, and each call extends across multiple webpages depending on the length of the call. To capture the data, the web scraping
  3. 3. Fig. 2. Earnings calls transcript data process flow package Scrapy is used to extract the question and answer section of call transcripts from seekingalpha.com. Acquiring the correct XPath values to obtain all the ques- tions / answers relating to who asks /responds from multiple pages rendered in this flat way takes a significant amount of trial and error. Lightly structured pages may take longer to code than those that are well structured depending on the users experience with Scrapy. Once the spider is created it is a brittle artefact relying on the premise that the webpage HTML structure remains unchanged for the course of the extraction. The foundation for the Scrapy spider was based on the transcript spider.py script from the Scraping Alpha git hub repository3 transcript spider. The new basic spider created takes a single call transcript url and parses every page of the call transcript (the original author only parses the first page). The complete question and answer section of the transcript (each individual analyst and executive and their associated question and answer couplet) is scraped and saved to a JSON file with the following attributes: title, company, executives, analysts, questions. The executives, analysts and 3https://github.com/Rumperuu/Scraping-Alpha [Accessed April 27, 2017] question are encapsulated in arrays. The basic spider is run for each url which is manually selected from the list to ensure it is a drinks industry related earnings call. The url is copied and stored in the script. In future the list could be compiled in advance and stored in a file to be read by the spider. Once the Scrapy spider has created the output file, it is imported to a MongoDB collection. MongoDB is selected as the initial storage component as it offers eases of ingestion of unstructured text via its import facility which takes the Scrapy output JSON file. The JSON format affords a degree of partitioning the text, because the Scrapy script manages the partitioning. Using a rational database, the data would be required to be normalised prior to load which for large bodies of text is where relational database may be less suitable [16]. The questions portion of the call transcripts is exported as a csv file using the MongoDB export command which caters for the selection of fields from the collection. The questions output file is copied and uploaded to HDFS, and ready for input for the MapReduce jobs. In future, the extraction of documents from MongoDB collection could be customised (using pymongo and python scripts) to create tailored question and answer subsets (e.g. mapped analyst questions and executive answers, questions and related answers and company attributes.) B. MapReduce jobs The MapReduce job conducts typical text pre-processing tasks: tokenisation of the text, counting unigram frequency and calculating unigram length. The word count design pattern is used with a custom tokeniser which converts each word in the string array to lowercase and removes all characters other than a-z. The reducer sums the word counts and inserts each word and count to the HBase ‘transcript’ table with a single column family. The second MapReduce job reads the ‘transcript’ table and for each word the length is taken and output to a new ‘transcriptLength’ table which contains ‘word’ and ‘length’. The final job queries the ‘transcript’ table and outputs a word count file to HDFS which is the input for the text analysis. In future additional MapReduce task could also conduct additional text pre-processing (e.g. generating additional n-grams, removing stop words, and stemming). C. Text Analysis The word frequency file is downloaded from HDFS and copied for input to a the host machine. The text analysis is conducted in R using the ZipfR package4 . ZipfR produces many additional vocabulary growth statistics which are beyond the scope of this study. The focus is on the examining initial findings in the data. The word frequency file is read, the column names are changed to ‘type and ‘f and the file is saved as tab delimited to match the ZipfR required input format. The tab delimited file is reloaded and a word frequency spectrum of the top 15 frequency classes is plotted in Figure 3. The word frequency spectrum shows the count of words in each frequency class. Large number of rare events as [17] finds 4http://zipfr.r-forge.r-project.org/ [Accessed April 27, 2017]
  4. 4. that in a word frequency distribution there is a high number of words that occur with a very low frequency. Words that only occur once are called hapax legomena. The earnings call Zipf distribution is plotted on a log-log scale to illustrate the word rank against frequency as shown in Figure 4. IV. RESULTS The earnings calls questions and answers section is trans- formed into a word frequency dataset of unigrams. The word frequency spectrum in Figure 3 illustrated below shows the top 15 frequency classes, a summary is provided in Table I. For example class 1 which is the hapax legomena class contains 2086 words which only appear once in the corpus (e.g. skirmish, slapping, sleeves, slice, slip, slips, slitting, slotting, slowly, sluggishness). Class 2 contains 785 words which appear twice, class 3 440 words appear three times and so forth. Fig. 3. Word frequency spectrum TABLE I ZIPFR WORD FREQUENCY SPECTRUM Sample size N = 95229 Vocabulary size V = 5372 Class size Vm = 2086 785 440 281 204 171 121 113 The Zipfian distribution of the earnings call question and answer section is shown in the log-log plot in in Figure 4. The plot shows a line with a slope of -1 which suggests that the earnings call corpus follows Zipf’s law. Examining the Goodness of fit summary (multivariate chi-squared test) produces: χ2 = 268.28 df(13) p < .001 which represents a poor fit and clearly suggests that word frequency does not adhere to Zipf’s law. This departure may be due to the non linear portions of the top ranked words. Examining the top 10 words from the word frequency data are shown in Table II and are common to the same unigrams in similar studies [13]. As the author finds the middle linear region of the distribution offers a higher probability of the word having an association with the given topic of interest. From the plot in Figure 4 the middle section appears to be Fig. 4. Zipf Distribution log-log plot linear and may contain value. Conversely, the top ranking and bottom ranking words appear to provide weak semantic contribution. Following this approach for the earnings calls unigrams, with a sample of 10 words ranks at 300 and 800 respectively, listed in Table III and IV, the words seem to be of relatively more interest. TABLE II TOP 10 WORDS Rank Frequency Word 1 4919 the 2 2858 and 3 2642 to 4 2531 in 5 2483 of 6 2268 that 7 1870 we 8 1592 you 9 1495 is 10 1295 i TABLE III RANK 300 TO 310 WORDS Rank Frequency Word 301 43 behind 302 43 carlos 303 43 example 304 43 further 305 43 momentum 306 43 robert 307 43 trade 308 43 world 309 42 audio 310 42 clarke Taking this small sample of 10 words an executive with domain knowledge may be able to identify words which carry significantly greater meaning for them. The package appears to rank the words alphabetically within frequency. Hence, all words for a given frequency should be viewed by the domain expert when conducting a review. Taking different portions of the middle region and asking the executive(s) to further rank the words in order of meaningfulness might add a level
  5. 5. TABLE IV RANK 800 TO 810 WORDS Rank Frequency Word 801 14 gone 802 14 guinness 803 14 happy 804 14 imports 805 14 inaccuracies 806 14 increased 807 14 increasing 808 14 internet 809 14 job 810 14 journalists of context. Moving beyond unigrams to trigrams and could provide an elevated level of context. V. CONCLUSION This study explores the use of a proof of concept to create a text processing pipeline for textual analysis. Text data is sourced from earnings call transcripts extracted using Scrapy from seekingalpha.com. The text corpus is processed using Hadoop MapReduce design patterns. Two database manage- ment systems and HDFS are used as intermediate storage. The resulting unigram word frequency is analysed using a Zipfian distribution and frequency spectrum analysis. The complete corpus fails to show adherence to Zipf’s law, but the middle section appears to hold a linear relationship and should be further investigated to test its goodness of fit. Middle ranking words from the linear section appear to offer meaning containing more nouns and adjectives and hence may be of interest to the executive. The study is limited by the small sample of earnings calls, and could be improved by further MapReduce tasks which stem and lemmatise the unigrams. Processing bi and tri grams and the inspection of the hapax legomena may also be an area of interest for further research. Finally, use of a domain expert to provide input, review the business question indicating its validity and how it might need to be adjusted to obtain more meaningful results. Additional information may be provided to create rules, advise on specific word dictionaries and their use to filter and classify words. REFERENCES [1] D. Cheng, X. Zhou, P. Lama, J. Wu, and C. Jiang, “Cross-platform Resource Scheduling for Spark and MapReduce on YARN,” IEEE Transactions on Computers, vol. 9340, no. NOVEMBER 2016, pp. 1– 14, 2016. [2] E.-P. Lim, H. Chen, and G. Chen, “Business intelligence and analytics: Research directions,” ACM Trans. Manage. Inf. Syst., vol. 3, no. 4, pp. 17:1–17:10, 2013. [3] S. M. Price, J. S. Doran, D. R. Peterson, and B. A. Bliss, “Earnings conference calls and stock returns: The incremental informativeness of textual tone,” Journal of Banking and Finance, vol. 36, no. 4, pp. 992– 1011, 2012. [4] T. Loughran and B. McDonald, “When is a liability not a liability? textual analysis, dictionaries, and 10-ks.” Journal of Finance, vol. 66, no. 1, pp. 35 – 65, 2011. [5] M. L. Black, “The world wide web as complex data set: Expanding the digital humanities into the twentieth century and beyond through internet research.” International Journal of Humanities Arts Computing: A Journal of Digital Humanities, vol. 10, no. 1, pp. 95 – 109, 2016. [6] K. D. Allee and M. D. Deangelis, “The structure of voluntary disclo- sure narratives: Evidence from tone dispersion,” Journal of Accounting Research, vol. 53, no. 2, pp. 241–274, 2015. [7] Y.-s. Kang, I.-h. Park, J. Rhee, and Y.-h. Lee, “MongoDB-based Repos- itory Design for IoT- generated RFID / Sensor Big Data,” IEEE Sensors Journal, vol. 16, no. 2, pp. 485–497, 2016. [8] C. von der Weth and A. Datta, “Multiterm keyword search in nosql systems,” IEEE Internet Computing, vol. 16, no. 1, pp. 34–42, 2012. [9] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [10] F. Li, B. C. Ooi, M. T. ¨Ozsu, and S. Wu, “Distributed data management using MapReduce,” ACM Computing Surveys, vol. 46, no. 3, pp. 1–42, 2014. [11] M. D. Assuno, R. N. Calheiros, S. Bianchi, M. A. Netto, and R. Buyya, “Big data computing and clouds: Trends and future directions,” Journal of Parallel and Distributed Computing, vol. 7980, pp. 3 – 15, 2015, special Issue on Scalable Systems for Big Data Management and Analytics. [12] J. Arias, J. A. Gamez, and J. M. Puerta, “Learning distributed discrete Bayesian Network Classifiers under MapReduce with Apache Spark,” Knowledge-Based Systems, vol. 117, pp. 16–26, 2016. [13] H. M. Chang, “Constructing n-gram rules for natural language models through exploring the limitation of the Zipf-Mandelbrot law,” Computing (Vienna/New York), vol. 91, no. 3, pp. 241–264, 2011. [14] T. Loughran and B. McDonald, “Textual Analysis in Accounting and Finance: A Survey,” Journal of Accounting Research, vol. 54, no. 4, pp. 1187–1230, 2016. [15] J. Bengfort, B. Kim, Data Analytics with Hadoop, 1st ed. O’Reilly Media, Inc, 2016. [16] A. Ittoo, L. M. Nguyen, and A. van den Bosch, “Text analytics in industry: Challenges, desiderata and trends,” Computers in Industry, vol. 78, pp. 96 – 107, 2016. [17] R. H. Baayen, Word frequency distributions. Springer Science & Business Media, 2001, vol. 18.