SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Predicting the Present with Google Trends




             Hyunyoung Choi
                Hal Varian

                c Google Inc.

           Draft Date April 2, 2009
Contents

1 Methodology                                                                                                                                                                            1
  1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                               1
  1.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                                2

2 Examples                                                                                                                                                                               6
  2.1 Retail Sales . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
  2.2 Automotive Sales . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
  2.3 Home Sales . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
  2.4 Travel . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15

3 Conclusion                                                                                                                                                                            18

4 Appendix                                                                                                                                                                              19
  4.1 R Code: Automotive sales example used in Section 1                                                . . . . . . . . . . . . . . . . . . . .                                         19




                                                                            i
Motivation

Can Google queries help predict economic activity?
       Economists, investors, and journalists avidly follow monthly government data releases on economic
conditions. However, these reports are only available with a lag: the data for a given month is generally
released about halfway through through the next month, and are typically revised several months later.
       Google Trends provides daily and weekly reports on the volume of queries related to various industries.

We hypothesize that this query data may be correlated with the current level of economic activity in
given industries and thus may be helpful in predicting the subsequent data releases.
       We are not claiming that Google Trends data help predict the future. Rather we are claiming that

Google Trends may help in predicting the present. For example, the volume of queries on a particular
brand of automobile during the second week in June may be helpful in predicting the June sales report
for that brand, when it is released in July.i.
       Our goals in this report are to familiarize readers with Google Trends data, illustrate some simple
forecasting methods that use this data, and encourage readers to undertake their own analyses. Certainly
it is possible to build more sophisticated forecasting models than those we describe here. However, we

believe that the models we describe can serve as baselines to help analysts get started with their own
modelling efforts and that can subsequently be refined for specific applications.
       The target audiences for this primer are readers with some background in econometrics or statistics.

Our examples use R, a freely available open-source statistics packageii. ; we provide the R source code for
the worked-out example in Section 1.2 in the Appendix.




 i.    It may also be true that June queries help to predict July sales, but we leave that question for future research.
 ii.   http://CRAN.R-project.org


                                                                ii
Chapter 1

Methodology

Here we provide an overview of the data and statistical methods we use, along with a worked out example.


1.1     Data

Google Trends provides an index of the volume of Google queries by geographic location and category.
   Google Trends data does not report the raw level of queries for a given search term. Rather, it reports
a query index. The query index starts with the query share: the total query volume for search term in

a given geographic region divided by the total number of queries in that region at a point in time. The
query share numbers are then normalized so that they start at 0 in January 1, 2004. Numbers at later
dates indicated the percentage deviation from the query share on January 1, 2004.
   This query index data is available at country and state level for the United States and several other
countries. There are two front ends for Google Trends data, but the most useful for our purposes is
http://www.google.com/insights/search which allows the user to download the query index data as

a CSV file.
   Figure 1.1 depicts an example from Google Trends for the query [coupon]. Note that the search share
for [coupon] increases during the holiday shopping season and the summer vacation season. There has

been a small increase in the query index for [coupon] over time and a significant increase in 2008, which
is likely due to the economic downturn.
   Google classifies search queries into 27 categories at the top level and 241 categories at the second
level using an automated classification engine. Queries are assigned to particular categories using natural
language processing methods. For example, the query [car tire] would be assigned to category Vehicle
Tires which is a subcategory of Auto Parts which is a subcategory of Automotive.




                                                    1
CHAPTER 1. METHODOLOGY                                                                                                 2




                           Figure 1.1: Google Trends by Keywords - Coupon



1.2      Model

In this section, we will discuss the relevant statistical background and walk through a simple example.
Our statistical model is implemented in R and the code may be found in the Appendix. The example

is based on Ford’s monthly sales from January 2004 to August 2008 as reported by Automotive News.
Google Trends data for the category Automotive/Vehicle Brands/Ford is used for the query index data.
   Denote Ford sales in the t-th month as {yt : t = 1, 2, · · · , T } and the Google Trends index in the k-th
                                (k)
week of the t-th month as {xt         : t = 1, 2, · · · , T ; k = 1, · · · , 4}. The first step in our analysis is to plot
the data in order to look for seasonality and structural trends. Figure 1.2 shows a declining trend and
strong seasonality in both Ford Sales and the Ford Query index.
   We start with a simple baseline forecasting model: sales this month are predicted using sales last
month and 12 months ago.

                             Model 0: log(yt ) ∼ log(yt−1 ) + log(yt−12 ) + et ,                                   (1.1)

The variable et is an error term. This type of model is known in the literature as a seasonal autoregressive
model or a seasonal AR model.
   We next add the query index for ‘Ford’ during the first week of each month to this model. Denoting
                  (1)
this variable by xt , we have

                                                                                  (1)
                          Model 1: log(yt ) ∼ log(yt−1 ) + log(yt−12 ) + xt + et                                   (1.2)

The least squares estimates for this model are shown in equation (1.3). The positive coefficient on the
Google Trends variable indicates that the search volume index is positively associated with Ford Sales
sales.
CHAPTER 1. METHODOLOGY                                                                                                                              3


          300000




                                                                                         0
                                                                                         −10
          250000




                                                                     Percentage Change
                                                                                         −20
  Sales
          200000




                                                                                         −30
          150000




                                                                                         −40
          100000




                                                                                         −50
                   2004   2005    2006          2007   2008   2009                             2004   2005    2006          2007   2008   2009

                                         Time                                                                        Time




                          (a) Ford Monthly Sales                                                      (b) Ford Google Trends


                                   Figure 1.2: Ford Monthly Sales and Ford Query Index



    Figure 1.3 depict the four standard regression diagnostic plots from R. Note that observation 18 (July

2005) is an outlier in each plot. We investigated this date and discovered that there was special promotion
event during July 2005 called an ‘employee pricing promotion.’ We added a dummy variable to control

for this observation and re-estimated the model. The results are shown in (1.4).

                                                                                                             (1)
          log(yt ) =         2.312 + 0.114 · log(yt−1 ) + 0.709 · log(yt−12 ) + 0.006 · xt                                                       (1.3)
                                                                                                             (1)
          log(yt ) =         2.007 + 0.105 · log(yt−1 ) + 0.737 · log(yt−12 ) + 0.005 · xt + 0.324 · I(July 2005). (1.4)

    Both models give us consistent results and the coefficients in common are similar. The 32.4% increase
in sales at July 2005 seems to be due to the employee pricing promotion. The coefficient on the Google
Trends variable in (1.4) implies that 1% increase in search volume is associated with roughly a 0.5%
increase in sales.

    Does the Google Trends data help with prediction? To answer this question we make a series of
one-month ahead predictions and compute the prediction error defined in Equation 1.5. The average of
he absolute values of the prediction errors is known as the mean absolute error (MAE). Each forecast
uses only the information available up to the time the forecast is made, which is one week into the month
in question.
CHAPTER 1. METHODOLOGY                                                                                                                                                                  4



                                                Residuals vs Fitted                                                                          Normal Q−Q

                                  0.3




                                                                                                                   3
                                  0.2                                 18                                                                                                  18




                                                                                                                   2
                                                                                          Standardized residuals
                                  0.1




                                                                                                                   1
       Residuals

                                  0.0




                                                                                                                   0
                                  −0.1




                                                                                                                   −1
                                  −0.2




                                                         53
                                                                                                                                53




                                                                                                                   −2
                                                                              30
                                  −0.3




                                                                                                                          30



                                         11.8     12.0        12.2         12.4                                            −2           −1          0        1        2

                                                    Fitted values                                                                       Theoretical Quantiles




                                                 Scale−Location                                                                      Residuals vs Leverage
                                                                      18
                                                                              30                                                                                               1
                                                                                                                   3                    18
                                  1.5




                                                                                                                                                                               0.5
                                                         53
                                                                                                                   2
         Standardized residuals




                                                                                          Standardized residuals

                                                                                                                   1
                                  1.0




                                                                                                                   0
                                                                                                                   −1
                                  0.5




                                                                                                                                                                          10
                                                                                                                   −2




                                                                                                                                                                               0.5

                                                                                                                                                        30
                                                                                                                   −3
                                  0.0




                                                                                                                                 Cook’s distance                               1



                                         11.8     12.0        12.2         12.4                                         0.00     0.05        0.10   0.15     0.20   0.25

                                                    Fitted values                                                                               Leverage




                                                  Figure 1.3: Diagnostic Plots for the regression model




                                                                                                                               yt − yt
                                                                                                                                     ˆ
                                                               PEt    =      log(ˆt ) − log(yt ) ≈
                                                                                 y                                                                                                   (1.5)
                                                                                                                                  yt
                                                                                   T
                                                                             1
                                                              MAE =                      |PEt |
                                                                             T     t=1

   Note that the model that includes the Google Trends query index has smaller absolute errors in most
months, and its mean absolute error over the entire forecast period is about 3 percent smaller. (Figure
1.4). Since July 2008, both models tend to overpredict sales and but Model 0 tends to overpredict
by more. It appears that the query index helps capture the fact that consumer interest in automotive
purchase has declined during this period.
Prediction Error in Percent




                                                                 −30
                                                                                  −20
                                                                                             −10
                                                                                                        0
                                      October 2007


                                    November 2007


                                    December 2007
                                                                                                            CHAPTER 1. METHODOLOGY




                                      January 2008




                                                      Model 1 (MAE = 9.22%)
                                                      Model 0 (MAE = 9.49%)
                                     February 2008


                                        March 2008


                                         April 2008


                                         May 2008




Figure 1.4: Prediction Error Plot
                                         June 2008


                                         July 2008


                                       August 2008


                                    September 2008
                                                                                                            5
Chapter 2

Examples

2.1        Retail Sales

The US Census Bureau releases the Advance Monthly Retail Sales survey 1-2 weeks after the close of each
month. These figures are based on a mail survey from a number of retail establishments and are thought
to be useful leading indicators of macroeconomic performance. The data are subsequently revised at least

two times; see ‘About the surveyi. ’ for the description of the procedures followed in constructing these
numbers.
       The retail sales data is organized according to the NAICS retail trade categories.ii.   The data is

reported in both seasonally adjusted and unadjusted form; for the analysis in this section, we use only
the unadjusted data.

                        NAICS Sectors                                       Google Categories
 ID        Title                                             ID       Title
 441       Motor vehicle and parts dealers                   47       Automotive
 442       Furniture and home furnishings stores             11       Home & Garden
 443       Electronics and appliance stores                  5        Computers & Electronics
 444       Building mat., garden equip. & supplies dealers   12-48    Construction & Maintenance
 445       Food and beverage stores                          71       Food & Drink
 446       Health and personal care stores                   45       Health
 447       Gasoline stations                                 12-233   Energy & Utilities
 448       Clothing and clothing access. stores              18-68    Apparel
 451       Sporting goods, hobby, book, and music stores     20-263   Sporting Goods
 452       General merchandise stores                        18-73    Mass Merchants & Department Stores
 453       Miscellaneous store retailers                     18       Shopping
 454       Nonstore retailers                                18-531   Shopping Portals & Search Engines
 722       Food services and drinking places                 71       Food & Drink


                                  Table 2.1: Sectors in Retail Sales Survey
 i.    http://www.census.gov/marts/www/marts.html
 ii.   http://www.census.gov/epcd/naics02/def/NDEF44.HTM




                                                       6
CHAPTER 2. EXAMPLES                                                                                 7

   As we indicated in the introduction, Google Trends provides a weekly time series of the volume of
Google queries by category. It is straightforward to match these categories to NAICS categories. Table
2.1 presents top level NAICS categories and the associated subcategories in Google Trends. We used
these subcategories to predict the retail sales release one month ahead.


                                            273: Motorcycles
         85000




                                                                                             40
         75000




                                                                                             20
                                                                                             0
         65000




                                                                                             −20
                 2005            2006                2007                  2008


                                          467: Auto Insurance
         85000




                                                                                             0
                                                                                             −10
         75000




                                                                                             −20
         65000




                                                                                             −30
                 2005            2006                2007                  2008


                                          610: Trucks & SUVs
         85000




                                                                                  Census
                                                                                             −5
                                                                                  Week One
                                                                                  Week Two
         75000




                                                                                             −15
         65000




                                                                                             −25




                 2005            2006                2007                  2008


Figure 2.1: Sales on ‘Motor vehicle and parts dealers’ from Census and corresponding Google Trends
data.



   Under the Automotive category, there are fourteen subcategories of the query index. From these
fourteen categories, the four most relevant subcateories are plotted against Census data for sales on
‘Motor Vehicles and Parts’ (Figure 2.1)
   We fit models from Section 1.2 to the data using 1 and 2 weeks of Google Trend data. The notation
CHAPTER 2. EXAMPLES                                                                                                                                                                                                                 8

 (1)
x610,t refers to Google category 610 in week 1 of month t. Our estimated models are


  Model                              0:    log(yt ) = 1.158 + 0.269 · log(yt−1 ) + 0.628 · log(yt−12 ), et ∼ N (0, 0.052 )                                                                                                       (2.1)
                                                                                                                                                                                                      (1)
  Model                              1:    log(yt ) = 1.513 + 0.216 · log(yt−1 ) + 0.656 · log(yt−12 ) + 0.007 · x610,t , et ∼ N (0, 0.062 )

  Model                              2:    log(yt ) = 0.332 + 0.230 · log(yt−1 ) + 0.748 · log(yt−12 )
                                                              (2)                                              (1)                                 (1)
                                           −0.001 · x273,t + 0.002 · x467,t + 0.004 · x610,t , et ∼ N (0, 0.052 ).


Note that the R2 moves from 0.6206 (Model 0) to 0.7852 (Model 1) to 0.7696 (Model 2). The models
show that the query index for ‘Trucks & SUVs’ exhibits positive association with reported sales on ‘Motor
Vehicles and Parts’.




                                       0
       Prediction Error in Percent




                                     −5



                                     −10



                                     −15
                                               Model 0 (MAE = 7.02%)
                                               Model 1 (MAE = 5.96%)
                                               Model 2 (MAE = 5.73%)
                                     −20
                                               October 2007


                                                              November 2007


                                                                              December 2007


                                                                                                January 2008


                                                                                                                     February 2008


                                                                                                                                     March 2008


                                                                                                                                                  April 2008


                                                                                                                                                               May 2008


                                                                                                                                                                          June 2008


                                                                                                                                                                                      July 2008


                                                                                                                                                                                                  August 2008


                                                                                                                                                                                                                September 2008




                                                                                              Figure 2.2: Prediction Error Plot



   Figure 2.2 illustrates that the mean absolute error of Model 1 is about 15% better than model 0, while
Model 2 is about 18% better in terms of this measure. However all forecasts have been overly optimistic
since early 2008.
CHAPTER 2. EXAMPLES                                                                                                                       9

2.2                Automotive Sales

In Section 2.1, we used Google Trends to predict retail sales in ‘Motor vehicle and parts dealers’. While
automotive sales are an important indicator of economic activity, manufacturers are likely more interested
in sales by make.

    In the Google Trends category Automotive/Vehicle Brands there are 31 subcategories which mea-
sure the relative search volume on various car makes. These can be easily matched to the 27 categories
reported in the ’US car and light-truck sales by make’ tables distributed by Automotive Monthly.
    We first estimated separate forecasting models for each of these 27 makes using essentially the same
method described in Section 1.1. As we saw in that section, it is helpful to have data on sales promotions

when pursuing this approach.
    Since we do not have such data, we tried an alternative fixed-effects modeling approach. That is, we
we assume that the short-term and seasonal lags are the same across all makes and that the differences
in sales volume by make can be captured by an additive fixed effect.


                                   Chevrolet                                                                   Chevrolet

                                                     Sales
                                                                     −20
          300000




                                                                                   300000




                                                     Google Trends
                                                                     −24
          250000




                                                                                   250000
                                                                     −28
  Sales




                                                                           Sales
                                                                     −32
          200000




                                                                                   200000
                                                                     −36
          150000




                                                                                   150000
                                                                     −40




                   2004    2005     2006      2007      2008                                2004       2005     2006      2007     2008

                                       Time                                                                        Time


                                     Toyota                                                                      Toyota
                                                                     45




                                                                                                   Sales
                                                                     35




                                                                                                   Fitted
          200000




                                                                                   200000
                                                                     25
  Sales




                                                                           Sales
                                                                     15
          160000




                                                                                   160000
                                                                     5
                                                                     −5
          120000




                                                                                   120000
                                                                     −15




                   2004    2005     2006      2007      2008                                2004       2005     2006      2007     2008

                                       Time                                                                        Time



                          (a) Sales vs. Google Trends                                                  (b) Actual & Fitted Sales


                          Figure 2.3: Sales and Google Trends for Top 2 Makes, Chevrolet & Toyota
CHAPTER 2. EXAMPLES                                                                                                    10

   Denote the automotive sales from the i-th make and t-th month as {yi,t : t = 1, 2, · · · , T ; i = 1, · · · , N }
                                                                  (k)
and the corresponding i-th Google Trends index as {xi,t : t = 1, 2, · · · , T ; i = 1, · · · , N ; k = 1, 2, 3}.
Considering the relatively longer research time associated car purchase, we used Google Trends from the
                                                      (3)                                                        (1)
second to last week of the previous month(xi,t−1 ) to the first week of the month in question (xi,t ) as
predictors.
   The estimates from Equation (2.2) indicate that the the Google Trends index for a particular make

in the last two week of last month is positively associated with current sales of that make.

              log(yi,t )   =     2.838 + 0.258 · log(yi,t−1 ) + 0.448 · log(yi,t−12 ) + δi · I(Car Make)i
                                            (1)             (2)            (3)
                                 +0.002 · xi,t + 0.003 · xi,t − 0.001 · xi,t , ei,t ∼ N (0, 0.132 ).               (2.2)


We can compare the fixed effects model to the separately estimated univariate models for each brand.
In each of the separately estimated models case we find a positive association with the relevant Google

Trends index. Here are two examples.

                                                                                     (2)
          Chevrolet :          log(yi,t ) = 7.367 + 0.439 · log(yi,t−12 ) + 0.017 · xi,t , et ∼ N (0, 0.1142 )     (2.3)
                                                                                     (2)
              Toyota :         log(yi,t ) = 4.124 + 0.655 · log(yi,t−12 ) + 0.003 · xi,t , et ∼ N (0, 0.0932 )


   Model (1.1) is fitted to each brand and compared to Model (2.2) and Model (2.3). As before, we made
rolling one-step ahead predictions from 2007-10-01 to 2008-09-01. We found that Model 1.1 performed
best for Chevrolet while Model 2.3 performed best for Toyota, as shown in Figure 2.4.
   One issue with the fixed effects model is that imposes the same seasonal effects for each make. This
may or may not be accurate. See, for example, the fit for Lexus in Figure 2.4. In December, Lexus has
traditionally run an ad campaign suggesting that a new Lexus would be a welcome Christmas present.
Hence we observe a strong seasonal spike in December Lexus sales which is not present with other makes.
In this case, it makes sense to estimate a separate model for Lexus. Indeed, if we do this we estimate a

separate model for Lexus, we get an improved fit with the mean absolute error falling from X to Y.
Prediction Error in Percent                                                         Prediction Error in Percent




                                                                                                      −20
                                                                                                                        −15
                                                                                                                              −10
                                                                                                                                    −5
                                                                                                                                         0
                                                                                                                                             5
                                                                                                                                                                                             −20
                                                                                                                                                                                                                −10
                                                                                                                                                                                                                      0
                                                                              September 2007                                                                     September 2007                                           10
                                                                                                                                                                                                                               CHAPTER 2. EXAMPLES




                                                                                October 2007                                                                       October 2007


                                                                              November 2007                                                                      November 2007


                                                                              December 2007                                                                      December 2007




                                                                                                Model (2.3) (MAE = 6.31%)
                                                                                                Model (2.2) (MAE = 5.41%)
                                                                                                Model (1.1) (MAE = 6.18%)
                                                                                                                                                                                   Model (2.3) (MAE = 9.17%)
                                                                                                                                                                                   Model (1.1) (MAE = 7.95%)
                                                                                                                                                                                   Model (2.2) (MAE = 11.16%)
                                                                                January 2008                                                                       January 2008


                                                                               February 2008                                                                      February 2008


                                                                                  March 2008                                                                         March 2008




                                                                 (b) Toyota
                                                                                                                                                 (a) Chevrolet
                                                                                   April 2008                                                                         April 2008


                                                                                   May 2008                                                                           May 2008


                                                                                   June 2008                                                                          June 2008


                                                                                   July 2008                                                                          July 2008


                                                                                 August 2008                                                                        August 2008




Figure 2.4: Prediction Error Plot by Make - Chevrolet & Toyota
                                                                                                                                                                                                                               11
CHAPTER 2. EXAMPLES                                                                                                                                               12

2.3            Home Sales

The US Census Bureau and the US Department of Housing and Urban Development release statistics on
the housing market at the end of each month.iii. The data includes figures on ‘New House Sold and For
Sale’ by price and stage of construction. New House Sales peaked in 2005 and have been declining since

then (Figure 2.5(a)). The price index peaked in early 2007 and has declined steadily for several months.
Recently the price index fell sharply (Figure 2.5(b)).
       The Google Trends ‘Real Estate’ category has 6 subcategories (Figure 2.6) - Real Estate Agencies
(Google Category Id: 96), Rental Listings & Referrals(378), Property Management(425), Home Inspec-
tions & Appraisal(463), Home Insurance(465), Home Financing(466). It turns out that the search index
for Real Estate Agencies is the best predictor for contemporaneous house sales.
                                                                  1000 1100 1200 1300 1400




                                                                                                                                                         330000
                                                                                             260000
         120




                                                                                                                                                         310000
                                                                                             240000
         100




                                                                                                                                                         290000
                                                                                             220000
                                                                  900
         80




                                                                                                                                                         270000
                                                                  800
                                                                  700




                                                                                             200000
         60




                                                                                                                                                         250000
                                                                  600
                                                                  500




                      Not Seasonally Adjusted                                                                                      Median Sales Price




                                                                                                                                                         230000
                                                                                             180000
         40




                      Seasonally Adjusted                                                                                          Average Sales Price


               2003     2004      2005      2006   2007   2008                                        2003    2004   2005   2006   2007     2008




                      (a) Number of New House Sold                                                           (b) Prices of New House Sold


                                          Figure 2.5: Number and Price of New House Sold



       We fit our model to seasonally adjusted sales figures, so we drop the 12-month lag used in our earlier
model, leaving us with Equation (2.4).


                                                   Model 0: log(yt ) ∼ log(yt−1 ) + et ,                                                                 (2.4)

where et is an error term. The model is fitted to the data and Equation (2.5) shows the estimates of the
iii.   http://www.census.gov/const/www/newressalesindex.html
CHAPTER 2. EXAMPLES                                                                                                             13

                           96: Real Estate Agencies                                378: Rental Listings & Referrals




                                                                    20
        20




                                                             1300




                                                                                                                         1300
                                                                    10
        10




                                                             1100




                                                                                                                         1100
        0




                                                                    0
                                                             900




                                                                                                                         900
        −10




                                                                    −10
                                                             700




                                                                                                                         700
        −20




                                                                    −20
        −30




                     Google Trends




                                                             500




                                                                                                                         500
                                                                    −30
                     New House Sale

              2004      2005          2006   2007     2008                2004      2005          2006     2007   2008


                          425: Property Management                               463: Home Inspections & Appraisal




                                                             1300




                                                                                                                         1300
        30




                                                                    30
        20




                                                                    20
                                                             1100




                                                                                                                         1100
        10




                                                                    10
                                                             900




                                                                                                                         900
        0




                                                                    0
                                                             700




                                                                                                                         700
        −10




                                                                    −10
        −20




                                                                                 Google Trends
                                                             500




                                                                                                                         500
                                                                    −20


                                                                                 New House Sale

              2004      2005          2006   2007     2008                2004      2005          2006     2007   2008


                               465: Home Insurance                                         466: Home Financing
                                                             1300




                                                                                                                         1300
        60




                                                                    40
                                                             1100




                                                                                                                         1100
        40




                                                                    20
        20




                                                             900




                                                                                                                         900
                                                                    0
        0




                                                             700




                                                                                                                         700
                                                                    −20
        −20




                     Google Trends
                                                             500




                                                                                                                         500
                     New House Sale

              2004      2005          2006   2007     2008                2004      2005          2006     2007   2008




              Figure 2.6: Time Series Plots: New House Sold vs. Subcategories of Google Trends



model. The model implies that (1) house sales at (t − 1) are positively related to house sales at t, (2)
the search index on ‘Rental Listings & Referrals’(378) is negatively related to sales, (3) the search index
for ‘Real Estate Agencies’(96) is positively related to sales, (4) the average housing price is negatively
associated with sales.

                                                                                   (1)                   (2)
    Model 1: log(yt ) = 5.795 + 0.871 · log(yt−1 ) − 0.005 · x378,t + 0.005x96,t − 0.391 · Avg Pricet(2.5)

The one-step ahead prediction errors are shown in Figure 2.7(a). The mean absolute error is about 12%
less for the model that includes the Google Trends variables.
CHAPTER 2. EXAMPLES                                                                                                                                                                                                                                        14




            Seasonally Adjusted Annual Sales Rate in 1,000

                                                              1400
                                                              1200
                                                              1000
                                                              800
                                                              600




                                                                      2004                                2005                                          2006                                            2007                       2008

                                                                                                                                                                        Time



                                                                                      (a) Seasonally Adjusted Annual Sales Rate vs. Fitted




                                                              10         Model 0 (MAE = 6.91%)
         Prediction Error in Percent




                                                                         Model 1 (MAE = 6.08%)
                                                               5

                                                               0

                                                             −5

                                                             −10

                                                             −15
                                                                        August 2007


                                                                                      September 2007


                                                                                                       October 2007


                                                                                                                        November 2007


                                                                                                                                        December 2007


                                                                                                                                                         January 2008


                                                                                                                                                                           February 2008


                                                                                                                                                                                           March 2008


                                                                                                                                                                                                           April 2008


                                                                                                                                                                                                                        May 2008


                                                                                                                                                                                                                                   June 2008


                                                                                                                                                                                                                                               July 2008




                                                                                                                      (b) 1 Step ahead Prediction Error


                                                                     Figure 2.7: New One Family House Sales - Fit and Prediction
CHAPTER 2. EXAMPLES                                                                                                                                                                                                                         15

2.4           Travel

The internet is commonly used for travel planning which suggests that Google Trends data about des-
tinations may be useful in predicting visits to that destination. We illustrate this using data from the
Hong Kong Tourism Board.iv.

                                 USA                                                     Canada                                                                                                             Britain
          130000




                                                                                                                                                  70000
                                                                45000




                                                                                                                                                                                                                                  5 10 15
                                                                                                                   110
                                                      70
                                                      60

                                                                40000




                                                                                                                                                  60000
                                                                                                                   0 10 20 30 40 50 60 70 80 90
          110000




                                                                                                                                                                                                                                  −5 0
                                                      50

                                                                35000
                                                      40




                                                                                                                                                  50000




                                                                                                                                                                                                                                  −15
                                                      30

                                                                30000
          90000




                                                      20




                                                                                                                                                  40000




                                                                                                                                                                                                                                  −25
                                                                25000
                                                      10




                                                                                                                                                                                                                                  −35
                                                                                                                                                  30000
          70000




                                                                20000
                                                      0
                                                      −10




                                                                                                                                                                                                                                  −45
                   2004   2005   2006   2007   2008                            2004   2005    2006   2007   2008                                                                              2004   2005    2006   2007   2008


                            Germany                                                          France                                                                                                          Italy



                                                                                                                   50
                                                      40




                                                                                                                                                  14000
                                                                25000




                                                                                                                   40
          25000




                                                      30




                                                                                                                                                                                                                                  −15
                                                                                                                                                  12000
                                                                                                                   30
                                                      20




                                                                                                                   20
                                                                20000




                                                                                                                                                                                                                                  −25
                                                                                                                                                  10000
          20000




                                                      10




                                                                                                                   10




                                                                                                                                                                                                                                  −35
                                                                                                                   0
                                                      0




                                                                                                                                                  8000
                                                                15000




                                                                                                                   −10
          15000




                                                      −10




                                                                                                                                                                                                                                  −45
                                                                                                                                                  6000
                                                      −20




                                                                                                                   −30




                                                                                                                                                                                                                                  −55
                                                                10000
          10000




                                                      −30




                                                                                                                                                  4000
                                                                                                                   −50




                                                                                                                                                                                                                                  −65
                   2004   2005   2006   2007   2008                            2004   2005    2006   2007   2008                                                                              2004   2005    2006   2007   2008


                            Australia                                                        Japan                                                                                                          India
                                                                                                                                                  15000 20000 25000 30000 35000 40000 45000
                                                      70




                                                                                                                                                                                                                                  120
                                                                                                                   130
                                                                120000
                                                      60
          60000




                                                                                                                                                                                                                                  100
                                                                                                                   110
                                                      50




                                                                                                                                                                                                                                  80
                                                      40
          50000




                                                                                                                   10 20 30 40 50 60 70 80 90
                                                                90000 100000
                                                      30




                                                                                                                                                                                                                                  60
                                                      20




                                                                                                                                                                                                                                  40
          40000




                                                      10




                                                                                                                                                                                                                                  20
                                                      0

                                                                80000
                                                      −20 −10




                                                                                                                                                                                                                                  0
          30000




                                                                70000




                                                                                                                                                                                                                                  −20




                   2004   2005   2006   2007   2008                            2004   2005    2006   2007   2008                                                                              2004   2005    2006   2007   2008




                                    Figure 2.8: Visitors Statistics and Google Trends by Country

Note: The black line depictes visitor arrival statistics and red line depicts the Google Trends index by country.


      The Hong Kong Tourism Board publishes monthly visitor arrival statistics, including ‘Monthly visitor
arrival summary’ by country/territory of residence, the mode of transportation, the mode of entry and
other criteria. We use visitor arrival statistics by country from January 2004 to August 2008 for this
iv.   http://partnernet.hktourismboard.com
CHAPTER 2. EXAMPLES                                                                                          16

analysis.
   The foreign exchange rate defined as HKD/Domestic currency is used as another predictor for visitor
volume. The Beijing Olympics were held from 2008-08-08 to 2008-08-24 and the traffic at July 2008 and
August 2008 is expected to be lower than usual so we use dummy variable to adjust the traffic difference
during those periods.
   ‘Hong Kong’ is one of the subcategories in under Vacation Destinations in Google Trends. The

countries of origin in our analysis are USA, Canada, Great Britain, Germany, France, Italy, Australia,
Japan and India. The visitors from these 9 countries are around 19% of total visitors to Hong Kong
during the period we examine. The visitor arrival statistic from all countries shows seasonality and an
increasing trend over time, but the trend growth rates differ by country(Figure 2.8).
   Here we examine a fixed effects model. In Equation 2.6. ‘Countryi ’ is a dummy variable to indicate
each country and the interaction with log(yi,t−12 ) captures the different year-to-year growth rate. ‘Beijing’

is another dummy variable to indicate Beijing Olympics period.

        log(yi,t )   =   2.412 + 0.059 · log(yi,t−1 ) + βi,12 · log(yi,t−12 ) × Countryi
                                                             (2)            (3)
                     + δi · Beijing × Countryi + 0.001 · xi,t + 0.001 · xi,t + ei,t , ei,t ∼ N (0, 0.092 )

   From Equation (2.6)), we learn that (1) arrivals last month are positively related to arrivals this

month, (2) arrivals 12 months ago are positively related to arrivals this month, (3) Google searches on
‘Hong Kong’ are positively related to arrivals, (4) during the Beijing Olympics, travel to Hong Kong
decreased.
   Table 2.2 is an Analysis of Variance table from Model 2.6. It shows that most of the variance is
explained by lag variable of arrivals and that the contribution from Google Trends variable is statistically
significant. Figure 2.9 shows the actual arrival statistics and fitted values. The model fits remarkably
well with Adjusted R2 equal to 0.9875.
CHAPTER 2. EXAMPLES                                                                                                                                                                       17

                                                             Df            Sum Sq        Mean Sq         F value                                                Pr(>F)
                                   log(y1)                    1             234.07        234.07       29,220.86                                               < 2.2e-16          ***
                                   Country                    8               5.82          0.73           90.74                                               < 2.2e-16          ***
                                   log(y12)                   1               9.02          9.02        1,126.49                                               < 2.2e-16          ***
                                     (2)
                                   xi,t                       1               0.44          0.44           54.34                                               1.13E-12           ***
                                     (3)
                                   xi,t                       1               0.03          0.03            3.87                                               0.049813           *
                                   Beijing                    1               0.41          0.41           51.23                                               4.53E-12           ***
                                   Country:log(y12)           8               0.23          0.03            3.59                                               0.000504           ***
                                   Country:Beijing            8               0.14          0.02            2.12                                               0.033388           *
                                   Residuals                366               2.93          0.01

                                                           Table 2.2: Estimates from Model (2.6)
                                                   Note: Signif. codes: ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.10


                                           USA                                      Canada                                                                          Britain
        130000




                                                                                                            70000
                                                                  45000
                                                                  40000




                                                                                                            60000
        110000




                                                                  35000




                                                                                                            50000
                                                                  30000
        70000 80000 90000




                                                                                                            40000
                                                                  25000
                                                                  20000




                                                                                                            30000


                            2005    2006    2007    2008                   2005   2006   2007   2008                                                    2005     2006      2007    2008


                                    Germany                                         France                                                                               Italy
                                                                                                            14000
                                                                  25000
        25000




                                                                                                            12000
                                                                  20000




                                                                                                            10000
        20000




                                                                                                            8000
                                                                  15000
        15000




                                                                                                            6000
                                                                  10000




                            2005    2006    2007    2008                   2005   2006   2007   2008                                                    2005     2006      2007    2008


                                    Australia                                        Japan                                                                               India
                                                                                                            15000 20000 25000 30000 35000 40000 45000




                                                                                                                                                                Actual
                                                                                                                                                                Fitted
        60000




                                                                  120000
                                                                  110000
        50000




                                                                  100000
        40000




                                                                  90000
        30000




                            2005    2006    2007    2008                   2005   2006   2007   2008                                                    2005     2006      2007    2008




                                                   Figure 2.9: Visitors Statistics and Fitted by Country

Note: The black line depicts the actual visitor arrival statistics and red line depicts fitted visitor arrival statistics
by country.
Chapter 3

Conclusion

We have found that simple seasonal AR models and fixed-effects models that includes relevant Google
Trends variables tend to outperform models that exclude these predictors. In some cases, the gain is only
a few percent, but in others can be quite substantial, as with the 18% improvement in the predictions
for ’Motor Vehicles and Parts’ and the 12% improvement for ’New Housing Starts’.

   One thing that we would like to investigate in future work is whether the Google Trends variables

are helpful in predicting “turning points” in the data. Simple autoregressive models due remarkably well
in extrapolating smooth trends; however, by their very nature, it is difficult for such models to describe
cases where the direction changes. Perhaps Google Trends data can help in such cases.
   Google Trends data is available at a state level for several countries. We have also had success with
forecasting various business metrics using state-level data.
   Currently Google Trends data is computed by a sampling method and varies somewhat from day to

day. This sampling error adds some additional noise to the data. As the product evolves, we expect to
see new features and more accurate estimation of the Trends query share indices.




                                                    18
Chapter 4

Appendix

4.1    R Code: Automotive sales example used in Section 1
##### Import Google Trends Data
google = read.csv(’googletrends.csv’);
google$date = as.Date(google$date);

##### Sales Data
dat = read.csv(quot;FordSales.csvquot;);
dat$month = as.Date(dat$month);
##### get ready for the forecasting;
dat = rbind(dat, dat[nrow(dat), ]);
dat[nrow(dat), ’month’] = as.Date(’2008-09-01’);
dat[nrow(dat), -1] = rep(NA, ncol(dat)-1);

##### Define Predictors - Time Lags;
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)]);
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)]);

##### Plot Sales & Google Trends data;
par(mfrow=c(2,1));
plot(sales ~ month, data= dat, lwd=2, type=’l’, main=’Ford Sales’,
  ylab=’Sales’, xlab=’Time’);
plot(trends ~ date, data= google, lwd=2, type=’l’, main=’Google Trends: Ford’,
  ylab=’Percentage Change’, xlab=’Time’);

##### Merge Sales Data w/ Google Trends Data
google$month = as.Date(paste(substr(google$date, 1, 7), ’01’, sep=’-’))
dat = merge(dat, google);

##### Define Predictor - Google Trends
##    t.lag defines the time lag between the research and purchase.
##      t.lag = 0 if you want to include last week of the previous month and
##              1st-2nd week of the corresponding month
##      t.lag = 1 if you want to include 1st-3rd week of the corresponding month
t.lag = 1;
id = which(dat$month[-1] != dat$month[-nrow(dat)]);
mdat = dat[id + 1, c(’month’, ’sales’, ’s1’, ’s12’)];


                                          19
CHAPTER 4. APPENDIX                                               20

mdat$trends1 = dat$trends[id + t.lag];
mdat$trends2 = dat$trends[id + t.lag + 1];
mdat$trends3 = dat$trends[id + t.lag + 2];

##### Divide data by two parts - model fitting & prediction
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]

##### Exploratory Data Analysis
## Testing Autocorrelation & Seasonality
acf(log(dat1$sales));
Box.test(log(dat1$sales), type=quot;Ljung-Boxquot;)
## Testing Correlation
plot(y = log(dat1$sales), x = dat1$trends1, main=’’, pch=19,
  ylab=’log(Sales)’, xlab= ’Google Trends - 1st week’)
abline(lm(log(dat1$sales) ~ dat1$trends1), lwd=2, col=2)
cor.test(y = log(dat1$sales), x = dat1$trends1)
cor.test(y = log(dat1$sales), x = dat1$trends2)
cor.test(y = log(dat1$sales), x = dat1$trends3)

##### Fit Model;
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1);
summary(fit)

##### Diagnostic Plot
par(mfrow=c(2,2));
plot(fit)

#### Prediction for the next month;
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE);

Mais conteúdo relacionado

Destaque

Unit 8: Overview of the Creative Industries
Unit 8: Overview of the Creative IndustriesUnit 8: Overview of the Creative Industries
Unit 8: Overview of the Creative IndustriesMike Cummins
 
Leadership and Organizational Change in the Context of IT
Leadership and  Organizational Change  in the Context of IT Leadership and  Organizational Change  in the Context of IT
Leadership and Organizational Change in the Context of IT Eashani Rodrigo
 
Organisational Behavior Leadership Style
Organisational Behavior Leadership StyleOrganisational Behavior Leadership Style
Organisational Behavior Leadership Stylevivek_kasana
 
Nivea Case Study
Nivea Case StudyNivea Case Study
Nivea Case StudyTarun Arya
 
Marketing Project
Marketing ProjectMarketing Project
Marketing Projectmluneau
 
Antianginal drugs and drugs used in ischaemia - drdhriti
Antianginal drugs and drugs used in ischaemia - drdhritiAntianginal drugs and drugs used in ischaemia - drdhriti
Antianginal drugs and drugs used in ischaemia - drdhritihttp://neigrihms.gov.in/
 
Emily Turner Bryant
Emily Turner  BryantEmily Turner  Bryant
Emily Turner Bryantguest15f29c
 
The Technology Of A Great Career (Mts)
The Technology Of A Great Career (Mts)The Technology Of A Great Career (Mts)
The Technology Of A Great Career (Mts)guest15f29c
 
Case Study On Aditya Birla Group
Case Study On Aditya Birla GroupCase Study On Aditya Birla Group
Case Study On Aditya Birla Groupindira 7
 
Presentation on “A Case Study Analysis” Leadership styles
Presentation on  “A Case Study Analysis” Leadership stylesPresentation on  “A Case Study Analysis” Leadership styles
Presentation on “A Case Study Analysis” Leadership stylesBenish
 

Destaque (18)

Unit 8: Overview of the Creative Industries
Unit 8: Overview of the Creative IndustriesUnit 8: Overview of the Creative Industries
Unit 8: Overview of the Creative Industries
 
Food and Beverage
Food and BeverageFood and Beverage
Food and Beverage
 
Leadership and Organizational Change in the Context of IT
Leadership and  Organizational Change  in the Context of IT Leadership and  Organizational Change  in the Context of IT
Leadership and Organizational Change in the Context of IT
 
Organisational Behavior Leadership Style
Organisational Behavior Leadership StyleOrganisational Behavior Leadership Style
Organisational Behavior Leadership Style
 
Heineken
HeinekenHeineken
Heineken
 
Nivea Case Study
Nivea Case StudyNivea Case Study
Nivea Case Study
 
Marketing Project
Marketing ProjectMarketing Project
Marketing Project
 
Antianginal drugs and drugs used in ischaemia - drdhriti
Antianginal drugs and drugs used in ischaemia - drdhritiAntianginal drugs and drugs used in ischaemia - drdhriti
Antianginal drugs and drugs used in ischaemia - drdhriti
 
Emily Turner Bryant
Emily Turner  BryantEmily Turner  Bryant
Emily Turner Bryant
 
The Technology Of A Great Career (Mts)
The Technology Of A Great Career (Mts)The Technology Of A Great Career (Mts)
The Technology Of A Great Career (Mts)
 
Aditya birla group
Aditya birla groupAditya birla group
Aditya birla group
 
Case Study On Aditya Birla Group
Case Study On Aditya Birla GroupCase Study On Aditya Birla Group
Case Study On Aditya Birla Group
 
Presentation on “A Case Study Analysis” Leadership styles
Presentation on  “A Case Study Analysis” Leadership stylesPresentation on  “A Case Study Analysis” Leadership styles
Presentation on “A Case Study Analysis” Leadership styles
 
Heineken
Heineken Heineken
Heineken
 
Titan watches
Titan watchesTitan watches
Titan watches
 
361 Pdfsam
361 Pdfsam361 Pdfsam
361 Pdfsam
 
Beautiful Sri Lanka
Beautiful Sri LankaBeautiful Sri Lanka
Beautiful Sri Lanka
 
41 Pdfsam
41 Pdfsam41 Pdfsam
41 Pdfsam
 

Semelhante a Predicting the Present with Google Trends

Report on Dasborad & scorecard
Report on Dasborad & scorecardReport on Dasborad & scorecard
Report on Dasborad & scorecardNewGate India
 
Acmeconsulting mpp live
Acmeconsulting mpp liveAcmeconsulting mpp live
Acmeconsulting mpp liveMohan Bista
 
Sample global car roof racks market report 2021
Sample global car roof racks market report 2021    Sample global car roof racks market report 2021
Sample global car roof racks market report 2021 Cognitive Market Research
 
Sample global car roof racks market report 2021
Sample global car roof racks market report 2021 Sample global car roof racks market report 2021
Sample global car roof racks market report 2021 Cognitive Market Research
 
Sample global car roof racks market report 2021
Sample global car roof racks market report 2021  Sample global car roof racks market report 2021
Sample global car roof racks market report 2021 Cognitive Market Research
 
Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’
Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’
Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’Gopalakrishnan D
 
Sample global automotive e call market report 2021
Sample global automotive e call market report 2021   Sample global automotive e call market report 2021
Sample global automotive e call market report 2021 Cognitive Market Research
 
Sample global automotive e call market report 2021
Sample global automotive e call market report 2021    Sample global automotive e call market report 2021
Sample global automotive e call market report 2021 Cognitive Market Research
 
Nissan plansbook3
Nissan plansbook3Nissan plansbook3
Nissan plansbook3anna3marie
 
PLANO DE MARKETING EXEMPLO MODA.pdf
PLANO DE MARKETING EXEMPLO MODA.pdfPLANO DE MARKETING EXEMPLO MODA.pdf
PLANO DE MARKETING EXEMPLO MODA.pdfAmnonArmoni
 
Paydiant Marketing Plan
Paydiant Marketing PlanPaydiant Marketing Plan
Paydiant Marketing PlanJasper Keijzer
 
Sample global bow and stern thrusters market report 2021
Sample global bow and stern thrusters market report 2021  Sample global bow and stern thrusters market report 2021
Sample global bow and stern thrusters market report 2021 Cognitive Market Research
 
Risk Allocation Engine (Projectwork)
Risk Allocation Engine (Projectwork)Risk Allocation Engine (Projectwork)
Risk Allocation Engine (Projectwork)HarbinAdemi
 

Semelhante a Predicting the Present with Google Trends (20)

Business Plan Example
Business Plan Example Business Plan Example
Business Plan Example
 
Report on Dasborad & scorecard
Report on Dasborad & scorecardReport on Dasborad & scorecard
Report on Dasborad & scorecard
 
Acmeconsulting mpp live
Acmeconsulting mpp liveAcmeconsulting mpp live
Acmeconsulting mpp live
 
Sample global car roof racks market report 2021
Sample global car roof racks market report 2021    Sample global car roof racks market report 2021
Sample global car roof racks market report 2021
 
Sample global car roof racks market report 2021
Sample global car roof racks market report 2021 Sample global car roof racks market report 2021
Sample global car roof racks market report 2021
 
Sample global car roof racks market report 2021
Sample global car roof racks market report 2021  Sample global car roof racks market report 2021
Sample global car roof racks market report 2021
 
Sample global apron market report 2021
Sample global apron market report 2021   Sample global apron market report 2021
Sample global apron market report 2021
 
Chef vending mpp
Chef vending mppChef vending mpp
Chef vending mpp
 
Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’
Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’
Strategic Marketing - Chevrolet Spark - ‘The Spark That Didn’t Light’
 
Sample global automotive e call market report 2021
Sample global automotive e call market report 2021   Sample global automotive e call market report 2021
Sample global automotive e call market report 2021
 
Sample global automotive e call market report 2021
Sample global automotive e call market report 2021    Sample global automotive e call market report 2021
Sample global automotive e call market report 2021
 
Notes econometricswithr
Notes econometricswithrNotes econometricswithr
Notes econometricswithr
 
Nissan plansbook3
Nissan plansbook3Nissan plansbook3
Nissan plansbook3
 
Sample global invar market report 2021
Sample global invar market report 2021    Sample global invar market report 2021
Sample global invar market report 2021
 
Sample global invar market report 2021
Sample global invar market report 2021   Sample global invar market report 2021
Sample global invar market report 2021
 
PLANO DE MARKETING EXEMPLO MODA.pdf
PLANO DE MARKETING EXEMPLO MODA.pdfPLANO DE MARKETING EXEMPLO MODA.pdf
PLANO DE MARKETING EXEMPLO MODA.pdf
 
Sample Global Тrеаdmіlls
Sample Global Тrеаdmіlls Sample Global Тrеаdmіlls
Sample Global Тrеаdmіlls
 
Paydiant Marketing Plan
Paydiant Marketing PlanPaydiant Marketing Plan
Paydiant Marketing Plan
 
Sample global bow and stern thrusters market report 2021
Sample global bow and stern thrusters market report 2021  Sample global bow and stern thrusters market report 2021
Sample global bow and stern thrusters market report 2021
 
Risk Allocation Engine (Projectwork)
Risk Allocation Engine (Projectwork)Risk Allocation Engine (Projectwork)
Risk Allocation Engine (Projectwork)
 

Mais de Robert Ressmann

Mögliche Cookie-Substitutionen (cookieless tracking) - The Cookie Crumbles
Mögliche Cookie-Substitutionen (cookieless tracking) - The Cookie CrumblesMögliche Cookie-Substitutionen (cookieless tracking) - The Cookie Crumbles
Mögliche Cookie-Substitutionen (cookieless tracking) - The Cookie CrumblesRobert Ressmann
 
Possible cookie substitutions (cookieless tracking) - The Cookie Crumbles
Possible cookie substitutions (cookieless tracking) - The Cookie CrumblesPossible cookie substitutions (cookieless tracking) - The Cookie Crumbles
Possible cookie substitutions (cookieless tracking) - The Cookie CrumblesRobert Ressmann
 
Studie social media_governance_2010_-_studienergebnisse
Studie social media_governance_2010_-_studienergebnisseStudie social media_governance_2010_-_studienergebnisse
Studie social media_governance_2010_-_studienergebnisseRobert Ressmann
 
Harris09presentationfinal
Harris09presentationfinalHarris09presentationfinal
Harris09presentationfinalRobert Ressmann
 
Pm Langzeitstudie 3 Quartal 09
Pm Langzeitstudie 3 Quartal 09Pm Langzeitstudie 3 Quartal 09
Pm Langzeitstudie 3 Quartal 09Robert Ressmann
 

Mais de Robert Ressmann (6)

Mögliche Cookie-Substitutionen (cookieless tracking) - The Cookie Crumbles
Mögliche Cookie-Substitutionen (cookieless tracking) - The Cookie CrumblesMögliche Cookie-Substitutionen (cookieless tracking) - The Cookie Crumbles
Mögliche Cookie-Substitutionen (cookieless tracking) - The Cookie Crumbles
 
Possible cookie substitutions (cookieless tracking) - The Cookie Crumbles
Possible cookie substitutions (cookieless tracking) - The Cookie CrumblesPossible cookie substitutions (cookieless tracking) - The Cookie Crumbles
Possible cookie substitutions (cookieless tracking) - The Cookie Crumbles
 
Studie social media_governance_2010_-_studienergebnisse
Studie social media_governance_2010_-_studienergebnisseStudie social media_governance_2010_-_studienergebnisse
Studie social media_governance_2010_-_studienergebnisse
 
Theplannersurvey2010
Theplannersurvey2010Theplannersurvey2010
Theplannersurvey2010
 
Harris09presentationfinal
Harris09presentationfinalHarris09presentationfinal
Harris09presentationfinal
 
Pm Langzeitstudie 3 Quartal 09
Pm Langzeitstudie 3 Quartal 09Pm Langzeitstudie 3 Quartal 09
Pm Langzeitstudie 3 Quartal 09
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Predicting the Present with Google Trends

  • 1. Predicting the Present with Google Trends Hyunyoung Choi Hal Varian c Google Inc. Draft Date April 2, 2009
  • 2. Contents 1 Methodology 1 1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Examples 6 2.1 Retail Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Automotive Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Home Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Travel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Conclusion 18 4 Appendix 19 4.1 R Code: Automotive sales example used in Section 1 . . . . . . . . . . . . . . . . . . . . 19 i
  • 3. Motivation Can Google queries help predict economic activity? Economists, investors, and journalists avidly follow monthly government data releases on economic conditions. However, these reports are only available with a lag: the data for a given month is generally released about halfway through through the next month, and are typically revised several months later. Google Trends provides daily and weekly reports on the volume of queries related to various industries. We hypothesize that this query data may be correlated with the current level of economic activity in given industries and thus may be helpful in predicting the subsequent data releases. We are not claiming that Google Trends data help predict the future. Rather we are claiming that Google Trends may help in predicting the present. For example, the volume of queries on a particular brand of automobile during the second week in June may be helpful in predicting the June sales report for that brand, when it is released in July.i. Our goals in this report are to familiarize readers with Google Trends data, illustrate some simple forecasting methods that use this data, and encourage readers to undertake their own analyses. Certainly it is possible to build more sophisticated forecasting models than those we describe here. However, we believe that the models we describe can serve as baselines to help analysts get started with their own modelling efforts and that can subsequently be refined for specific applications. The target audiences for this primer are readers with some background in econometrics or statistics. Our examples use R, a freely available open-source statistics packageii. ; we provide the R source code for the worked-out example in Section 1.2 in the Appendix. i. It may also be true that June queries help to predict July sales, but we leave that question for future research. ii. http://CRAN.R-project.org ii
  • 4. Chapter 1 Methodology Here we provide an overview of the data and statistical methods we use, along with a worked out example. 1.1 Data Google Trends provides an index of the volume of Google queries by geographic location and category. Google Trends data does not report the raw level of queries for a given search term. Rather, it reports a query index. The query index starts with the query share: the total query volume for search term in a given geographic region divided by the total number of queries in that region at a point in time. The query share numbers are then normalized so that they start at 0 in January 1, 2004. Numbers at later dates indicated the percentage deviation from the query share on January 1, 2004. This query index data is available at country and state level for the United States and several other countries. There are two front ends for Google Trends data, but the most useful for our purposes is http://www.google.com/insights/search which allows the user to download the query index data as a CSV file. Figure 1.1 depicts an example from Google Trends for the query [coupon]. Note that the search share for [coupon] increases during the holiday shopping season and the summer vacation season. There has been a small increase in the query index for [coupon] over time and a significant increase in 2008, which is likely due to the economic downturn. Google classifies search queries into 27 categories at the top level and 241 categories at the second level using an automated classification engine. Queries are assigned to particular categories using natural language processing methods. For example, the query [car tire] would be assigned to category Vehicle Tires which is a subcategory of Auto Parts which is a subcategory of Automotive. 1
  • 5. CHAPTER 1. METHODOLOGY 2 Figure 1.1: Google Trends by Keywords - Coupon 1.2 Model In this section, we will discuss the relevant statistical background and walk through a simple example. Our statistical model is implemented in R and the code may be found in the Appendix. The example is based on Ford’s monthly sales from January 2004 to August 2008 as reported by Automotive News. Google Trends data for the category Automotive/Vehicle Brands/Ford is used for the query index data. Denote Ford sales in the t-th month as {yt : t = 1, 2, · · · , T } and the Google Trends index in the k-th (k) week of the t-th month as {xt : t = 1, 2, · · · , T ; k = 1, · · · , 4}. The first step in our analysis is to plot the data in order to look for seasonality and structural trends. Figure 1.2 shows a declining trend and strong seasonality in both Ford Sales and the Ford Query index. We start with a simple baseline forecasting model: sales this month are predicted using sales last month and 12 months ago. Model 0: log(yt ) ∼ log(yt−1 ) + log(yt−12 ) + et , (1.1) The variable et is an error term. This type of model is known in the literature as a seasonal autoregressive model or a seasonal AR model. We next add the query index for ‘Ford’ during the first week of each month to this model. Denoting (1) this variable by xt , we have (1) Model 1: log(yt ) ∼ log(yt−1 ) + log(yt−12 ) + xt + et (1.2) The least squares estimates for this model are shown in equation (1.3). The positive coefficient on the Google Trends variable indicates that the search volume index is positively associated with Ford Sales sales.
  • 6. CHAPTER 1. METHODOLOGY 3 300000 0 −10 250000 Percentage Change −20 Sales 200000 −30 150000 −40 100000 −50 2004 2005 2006 2007 2008 2009 2004 2005 2006 2007 2008 2009 Time Time (a) Ford Monthly Sales (b) Ford Google Trends Figure 1.2: Ford Monthly Sales and Ford Query Index Figure 1.3 depict the four standard regression diagnostic plots from R. Note that observation 18 (July 2005) is an outlier in each plot. We investigated this date and discovered that there was special promotion event during July 2005 called an ‘employee pricing promotion.’ We added a dummy variable to control for this observation and re-estimated the model. The results are shown in (1.4). (1) log(yt ) = 2.312 + 0.114 · log(yt−1 ) + 0.709 · log(yt−12 ) + 0.006 · xt (1.3) (1) log(yt ) = 2.007 + 0.105 · log(yt−1 ) + 0.737 · log(yt−12 ) + 0.005 · xt + 0.324 · I(July 2005). (1.4) Both models give us consistent results and the coefficients in common are similar. The 32.4% increase in sales at July 2005 seems to be due to the employee pricing promotion. The coefficient on the Google Trends variable in (1.4) implies that 1% increase in search volume is associated with roughly a 0.5% increase in sales. Does the Google Trends data help with prediction? To answer this question we make a series of one-month ahead predictions and compute the prediction error defined in Equation 1.5. The average of he absolute values of the prediction errors is known as the mean absolute error (MAE). Each forecast uses only the information available up to the time the forecast is made, which is one week into the month in question.
  • 7. CHAPTER 1. METHODOLOGY 4 Residuals vs Fitted Normal Q−Q 0.3 3 0.2 18 18 2 Standardized residuals 0.1 1 Residuals 0.0 0 −0.1 −1 −0.2 53 53 −2 30 −0.3 30 11.8 12.0 12.2 12.4 −2 −1 0 1 2 Fitted values Theoretical Quantiles Scale−Location Residuals vs Leverage 18 30 1 3 18 1.5 0.5 53 2 Standardized residuals Standardized residuals 1 1.0 0 −1 0.5 10 −2 0.5 30 −3 0.0 Cook’s distance 1 11.8 12.0 12.2 12.4 0.00 0.05 0.10 0.15 0.20 0.25 Fitted values Leverage Figure 1.3: Diagnostic Plots for the regression model yt − yt ˆ PEt = log(ˆt ) − log(yt ) ≈ y (1.5) yt T 1 MAE = |PEt | T t=1 Note that the model that includes the Google Trends query index has smaller absolute errors in most months, and its mean absolute error over the entire forecast period is about 3 percent smaller. (Figure 1.4). Since July 2008, both models tend to overpredict sales and but Model 0 tends to overpredict by more. It appears that the query index helps capture the fact that consumer interest in automotive purchase has declined during this period.
  • 8. Prediction Error in Percent −30 −20 −10 0 October 2007 November 2007 December 2007 CHAPTER 1. METHODOLOGY January 2008 Model 1 (MAE = 9.22%) Model 0 (MAE = 9.49%) February 2008 March 2008 April 2008 May 2008 Figure 1.4: Prediction Error Plot June 2008 July 2008 August 2008 September 2008 5
  • 9. Chapter 2 Examples 2.1 Retail Sales The US Census Bureau releases the Advance Monthly Retail Sales survey 1-2 weeks after the close of each month. These figures are based on a mail survey from a number of retail establishments and are thought to be useful leading indicators of macroeconomic performance. The data are subsequently revised at least two times; see ‘About the surveyi. ’ for the description of the procedures followed in constructing these numbers. The retail sales data is organized according to the NAICS retail trade categories.ii. The data is reported in both seasonally adjusted and unadjusted form; for the analysis in this section, we use only the unadjusted data. NAICS Sectors Google Categories ID Title ID Title 441 Motor vehicle and parts dealers 47 Automotive 442 Furniture and home furnishings stores 11 Home & Garden 443 Electronics and appliance stores 5 Computers & Electronics 444 Building mat., garden equip. & supplies dealers 12-48 Construction & Maintenance 445 Food and beverage stores 71 Food & Drink 446 Health and personal care stores 45 Health 447 Gasoline stations 12-233 Energy & Utilities 448 Clothing and clothing access. stores 18-68 Apparel 451 Sporting goods, hobby, book, and music stores 20-263 Sporting Goods 452 General merchandise stores 18-73 Mass Merchants & Department Stores 453 Miscellaneous store retailers 18 Shopping 454 Nonstore retailers 18-531 Shopping Portals & Search Engines 722 Food services and drinking places 71 Food & Drink Table 2.1: Sectors in Retail Sales Survey i. http://www.census.gov/marts/www/marts.html ii. http://www.census.gov/epcd/naics02/def/NDEF44.HTM 6
  • 10. CHAPTER 2. EXAMPLES 7 As we indicated in the introduction, Google Trends provides a weekly time series of the volume of Google queries by category. It is straightforward to match these categories to NAICS categories. Table 2.1 presents top level NAICS categories and the associated subcategories in Google Trends. We used these subcategories to predict the retail sales release one month ahead. 273: Motorcycles 85000 40 75000 20 0 65000 −20 2005 2006 2007 2008 467: Auto Insurance 85000 0 −10 75000 −20 65000 −30 2005 2006 2007 2008 610: Trucks & SUVs 85000 Census −5 Week One Week Two 75000 −15 65000 −25 2005 2006 2007 2008 Figure 2.1: Sales on ‘Motor vehicle and parts dealers’ from Census and corresponding Google Trends data. Under the Automotive category, there are fourteen subcategories of the query index. From these fourteen categories, the four most relevant subcateories are plotted against Census data for sales on ‘Motor Vehicles and Parts’ (Figure 2.1) We fit models from Section 1.2 to the data using 1 and 2 weeks of Google Trend data. The notation
  • 11. CHAPTER 2. EXAMPLES 8 (1) x610,t refers to Google category 610 in week 1 of month t. Our estimated models are Model 0: log(yt ) = 1.158 + 0.269 · log(yt−1 ) + 0.628 · log(yt−12 ), et ∼ N (0, 0.052 ) (2.1) (1) Model 1: log(yt ) = 1.513 + 0.216 · log(yt−1 ) + 0.656 · log(yt−12 ) + 0.007 · x610,t , et ∼ N (0, 0.062 ) Model 2: log(yt ) = 0.332 + 0.230 · log(yt−1 ) + 0.748 · log(yt−12 ) (2) (1) (1) −0.001 · x273,t + 0.002 · x467,t + 0.004 · x610,t , et ∼ N (0, 0.052 ). Note that the R2 moves from 0.6206 (Model 0) to 0.7852 (Model 1) to 0.7696 (Model 2). The models show that the query index for ‘Trucks & SUVs’ exhibits positive association with reported sales on ‘Motor Vehicles and Parts’. 0 Prediction Error in Percent −5 −10 −15 Model 0 (MAE = 7.02%) Model 1 (MAE = 5.96%) Model 2 (MAE = 5.73%) −20 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 June 2008 July 2008 August 2008 September 2008 Figure 2.2: Prediction Error Plot Figure 2.2 illustrates that the mean absolute error of Model 1 is about 15% better than model 0, while Model 2 is about 18% better in terms of this measure. However all forecasts have been overly optimistic since early 2008.
  • 12. CHAPTER 2. EXAMPLES 9 2.2 Automotive Sales In Section 2.1, we used Google Trends to predict retail sales in ‘Motor vehicle and parts dealers’. While automotive sales are an important indicator of economic activity, manufacturers are likely more interested in sales by make. In the Google Trends category Automotive/Vehicle Brands there are 31 subcategories which mea- sure the relative search volume on various car makes. These can be easily matched to the 27 categories reported in the ’US car and light-truck sales by make’ tables distributed by Automotive Monthly. We first estimated separate forecasting models for each of these 27 makes using essentially the same method described in Section 1.1. As we saw in that section, it is helpful to have data on sales promotions when pursuing this approach. Since we do not have such data, we tried an alternative fixed-effects modeling approach. That is, we we assume that the short-term and seasonal lags are the same across all makes and that the differences in sales volume by make can be captured by an additive fixed effect. Chevrolet Chevrolet Sales −20 300000 300000 Google Trends −24 250000 250000 −28 Sales Sales −32 200000 200000 −36 150000 150000 −40 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 Time Time Toyota Toyota 45 Sales 35 Fitted 200000 200000 25 Sales Sales 15 160000 160000 5 −5 120000 120000 −15 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 Time Time (a) Sales vs. Google Trends (b) Actual & Fitted Sales Figure 2.3: Sales and Google Trends for Top 2 Makes, Chevrolet & Toyota
  • 13. CHAPTER 2. EXAMPLES 10 Denote the automotive sales from the i-th make and t-th month as {yi,t : t = 1, 2, · · · , T ; i = 1, · · · , N } (k) and the corresponding i-th Google Trends index as {xi,t : t = 1, 2, · · · , T ; i = 1, · · · , N ; k = 1, 2, 3}. Considering the relatively longer research time associated car purchase, we used Google Trends from the (3) (1) second to last week of the previous month(xi,t−1 ) to the first week of the month in question (xi,t ) as predictors. The estimates from Equation (2.2) indicate that the the Google Trends index for a particular make in the last two week of last month is positively associated with current sales of that make. log(yi,t ) = 2.838 + 0.258 · log(yi,t−1 ) + 0.448 · log(yi,t−12 ) + δi · I(Car Make)i (1) (2) (3) +0.002 · xi,t + 0.003 · xi,t − 0.001 · xi,t , ei,t ∼ N (0, 0.132 ). (2.2) We can compare the fixed effects model to the separately estimated univariate models for each brand. In each of the separately estimated models case we find a positive association with the relevant Google Trends index. Here are two examples. (2) Chevrolet : log(yi,t ) = 7.367 + 0.439 · log(yi,t−12 ) + 0.017 · xi,t , et ∼ N (0, 0.1142 ) (2.3) (2) Toyota : log(yi,t ) = 4.124 + 0.655 · log(yi,t−12 ) + 0.003 · xi,t , et ∼ N (0, 0.0932 ) Model (1.1) is fitted to each brand and compared to Model (2.2) and Model (2.3). As before, we made rolling one-step ahead predictions from 2007-10-01 to 2008-09-01. We found that Model 1.1 performed best for Chevrolet while Model 2.3 performed best for Toyota, as shown in Figure 2.4. One issue with the fixed effects model is that imposes the same seasonal effects for each make. This may or may not be accurate. See, for example, the fit for Lexus in Figure 2.4. In December, Lexus has traditionally run an ad campaign suggesting that a new Lexus would be a welcome Christmas present. Hence we observe a strong seasonal spike in December Lexus sales which is not present with other makes. In this case, it makes sense to estimate a separate model for Lexus. Indeed, if we do this we estimate a separate model for Lexus, we get an improved fit with the mean absolute error falling from X to Y.
  • 14. Prediction Error in Percent Prediction Error in Percent −20 −15 −10 −5 0 5 −20 −10 0 September 2007 September 2007 10 CHAPTER 2. EXAMPLES October 2007 October 2007 November 2007 November 2007 December 2007 December 2007 Model (2.3) (MAE = 6.31%) Model (2.2) (MAE = 5.41%) Model (1.1) (MAE = 6.18%) Model (2.3) (MAE = 9.17%) Model (1.1) (MAE = 7.95%) Model (2.2) (MAE = 11.16%) January 2008 January 2008 February 2008 February 2008 March 2008 March 2008 (b) Toyota (a) Chevrolet April 2008 April 2008 May 2008 May 2008 June 2008 June 2008 July 2008 July 2008 August 2008 August 2008 Figure 2.4: Prediction Error Plot by Make - Chevrolet & Toyota 11
  • 15. CHAPTER 2. EXAMPLES 12 2.3 Home Sales The US Census Bureau and the US Department of Housing and Urban Development release statistics on the housing market at the end of each month.iii. The data includes figures on ‘New House Sold and For Sale’ by price and stage of construction. New House Sales peaked in 2005 and have been declining since then (Figure 2.5(a)). The price index peaked in early 2007 and has declined steadily for several months. Recently the price index fell sharply (Figure 2.5(b)). The Google Trends ‘Real Estate’ category has 6 subcategories (Figure 2.6) - Real Estate Agencies (Google Category Id: 96), Rental Listings & Referrals(378), Property Management(425), Home Inspec- tions & Appraisal(463), Home Insurance(465), Home Financing(466). It turns out that the search index for Real Estate Agencies is the best predictor for contemporaneous house sales. 1000 1100 1200 1300 1400 330000 260000 120 310000 240000 100 290000 220000 900 80 270000 800 700 200000 60 250000 600 500 Not Seasonally Adjusted Median Sales Price 230000 180000 40 Seasonally Adjusted Average Sales Price 2003 2004 2005 2006 2007 2008 2003 2004 2005 2006 2007 2008 (a) Number of New House Sold (b) Prices of New House Sold Figure 2.5: Number and Price of New House Sold We fit our model to seasonally adjusted sales figures, so we drop the 12-month lag used in our earlier model, leaving us with Equation (2.4). Model 0: log(yt ) ∼ log(yt−1 ) + et , (2.4) where et is an error term. The model is fitted to the data and Equation (2.5) shows the estimates of the iii. http://www.census.gov/const/www/newressalesindex.html
  • 16. CHAPTER 2. EXAMPLES 13 96: Real Estate Agencies 378: Rental Listings & Referrals 20 20 1300 1300 10 10 1100 1100 0 0 900 900 −10 −10 700 700 −20 −20 −30 Google Trends 500 500 −30 New House Sale 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 425: Property Management 463: Home Inspections & Appraisal 1300 1300 30 30 20 20 1100 1100 10 10 900 900 0 0 700 700 −10 −10 −20 Google Trends 500 500 −20 New House Sale 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 465: Home Insurance 466: Home Financing 1300 1300 60 40 1100 1100 40 20 20 900 900 0 0 700 700 −20 −20 Google Trends 500 500 New House Sale 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 Figure 2.6: Time Series Plots: New House Sold vs. Subcategories of Google Trends model. The model implies that (1) house sales at (t − 1) are positively related to house sales at t, (2) the search index on ‘Rental Listings & Referrals’(378) is negatively related to sales, (3) the search index for ‘Real Estate Agencies’(96) is positively related to sales, (4) the average housing price is negatively associated with sales. (1) (2) Model 1: log(yt ) = 5.795 + 0.871 · log(yt−1 ) − 0.005 · x378,t + 0.005x96,t − 0.391 · Avg Pricet(2.5) The one-step ahead prediction errors are shown in Figure 2.7(a). The mean absolute error is about 12% less for the model that includes the Google Trends variables.
  • 17. CHAPTER 2. EXAMPLES 14 Seasonally Adjusted Annual Sales Rate in 1,000 1400 1200 1000 800 600 2004 2005 2006 2007 2008 Time (a) Seasonally Adjusted Annual Sales Rate vs. Fitted 10 Model 0 (MAE = 6.91%) Prediction Error in Percent Model 1 (MAE = 6.08%) 5 0 −5 −10 −15 August 2007 September 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 June 2008 July 2008 (b) 1 Step ahead Prediction Error Figure 2.7: New One Family House Sales - Fit and Prediction
  • 18. CHAPTER 2. EXAMPLES 15 2.4 Travel The internet is commonly used for travel planning which suggests that Google Trends data about des- tinations may be useful in predicting visits to that destination. We illustrate this using data from the Hong Kong Tourism Board.iv. USA Canada Britain 130000 70000 45000 5 10 15 110 70 60 40000 60000 0 10 20 30 40 50 60 70 80 90 110000 −5 0 50 35000 40 50000 −15 30 30000 90000 20 40000 −25 25000 10 −35 30000 70000 20000 0 −10 −45 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 Germany France Italy 50 40 14000 25000 40 25000 30 −15 12000 30 20 20 20000 −25 10000 20000 10 10 −35 0 0 8000 15000 −10 15000 −10 −45 6000 −20 −30 −55 10000 10000 −30 4000 −50 −65 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 Australia Japan India 15000 20000 25000 30000 35000 40000 45000 70 120 130 120000 60 60000 100 110 50 80 40 50000 10 20 30 40 50 60 70 80 90 90000 100000 30 60 20 40 40000 10 20 0 80000 −20 −10 0 30000 70000 −20 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 2004 2005 2006 2007 2008 Figure 2.8: Visitors Statistics and Google Trends by Country Note: The black line depictes visitor arrival statistics and red line depicts the Google Trends index by country. The Hong Kong Tourism Board publishes monthly visitor arrival statistics, including ‘Monthly visitor arrival summary’ by country/territory of residence, the mode of transportation, the mode of entry and other criteria. We use visitor arrival statistics by country from January 2004 to August 2008 for this iv. http://partnernet.hktourismboard.com
  • 19. CHAPTER 2. EXAMPLES 16 analysis. The foreign exchange rate defined as HKD/Domestic currency is used as another predictor for visitor volume. The Beijing Olympics were held from 2008-08-08 to 2008-08-24 and the traffic at July 2008 and August 2008 is expected to be lower than usual so we use dummy variable to adjust the traffic difference during those periods. ‘Hong Kong’ is one of the subcategories in under Vacation Destinations in Google Trends. The countries of origin in our analysis are USA, Canada, Great Britain, Germany, France, Italy, Australia, Japan and India. The visitors from these 9 countries are around 19% of total visitors to Hong Kong during the period we examine. The visitor arrival statistic from all countries shows seasonality and an increasing trend over time, but the trend growth rates differ by country(Figure 2.8). Here we examine a fixed effects model. In Equation 2.6. ‘Countryi ’ is a dummy variable to indicate each country and the interaction with log(yi,t−12 ) captures the different year-to-year growth rate. ‘Beijing’ is another dummy variable to indicate Beijing Olympics period. log(yi,t ) = 2.412 + 0.059 · log(yi,t−1 ) + βi,12 · log(yi,t−12 ) × Countryi (2) (3) + δi · Beijing × Countryi + 0.001 · xi,t + 0.001 · xi,t + ei,t , ei,t ∼ N (0, 0.092 ) From Equation (2.6)), we learn that (1) arrivals last month are positively related to arrivals this month, (2) arrivals 12 months ago are positively related to arrivals this month, (3) Google searches on ‘Hong Kong’ are positively related to arrivals, (4) during the Beijing Olympics, travel to Hong Kong decreased. Table 2.2 is an Analysis of Variance table from Model 2.6. It shows that most of the variance is explained by lag variable of arrivals and that the contribution from Google Trends variable is statistically significant. Figure 2.9 shows the actual arrival statistics and fitted values. The model fits remarkably well with Adjusted R2 equal to 0.9875.
  • 20. CHAPTER 2. EXAMPLES 17 Df Sum Sq Mean Sq F value Pr(>F) log(y1) 1 234.07 234.07 29,220.86 < 2.2e-16 *** Country 8 5.82 0.73 90.74 < 2.2e-16 *** log(y12) 1 9.02 9.02 1,126.49 < 2.2e-16 *** (2) xi,t 1 0.44 0.44 54.34 1.13E-12 *** (3) xi,t 1 0.03 0.03 3.87 0.049813 * Beijing 1 0.41 0.41 51.23 4.53E-12 *** Country:log(y12) 8 0.23 0.03 3.59 0.000504 *** Country:Beijing 8 0.14 0.02 2.12 0.033388 * Residuals 366 2.93 0.01 Table 2.2: Estimates from Model (2.6) Note: Signif. codes: ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.10 USA Canada Britain 130000 70000 45000 40000 60000 110000 35000 50000 30000 70000 80000 90000 40000 25000 20000 30000 2005 2006 2007 2008 2005 2006 2007 2008 2005 2006 2007 2008 Germany France Italy 14000 25000 25000 12000 20000 10000 20000 8000 15000 15000 6000 10000 2005 2006 2007 2008 2005 2006 2007 2008 2005 2006 2007 2008 Australia Japan India 15000 20000 25000 30000 35000 40000 45000 Actual Fitted 60000 120000 110000 50000 100000 40000 90000 30000 2005 2006 2007 2008 2005 2006 2007 2008 2005 2006 2007 2008 Figure 2.9: Visitors Statistics and Fitted by Country Note: The black line depicts the actual visitor arrival statistics and red line depicts fitted visitor arrival statistics by country.
  • 21. Chapter 3 Conclusion We have found that simple seasonal AR models and fixed-effects models that includes relevant Google Trends variables tend to outperform models that exclude these predictors. In some cases, the gain is only a few percent, but in others can be quite substantial, as with the 18% improvement in the predictions for ’Motor Vehicles and Parts’ and the 12% improvement for ’New Housing Starts’. One thing that we would like to investigate in future work is whether the Google Trends variables are helpful in predicting “turning points” in the data. Simple autoregressive models due remarkably well in extrapolating smooth trends; however, by their very nature, it is difficult for such models to describe cases where the direction changes. Perhaps Google Trends data can help in such cases. Google Trends data is available at a state level for several countries. We have also had success with forecasting various business metrics using state-level data. Currently Google Trends data is computed by a sampling method and varies somewhat from day to day. This sampling error adds some additional noise to the data. As the product evolves, we expect to see new features and more accurate estimation of the Trends query share indices. 18
  • 22. Chapter 4 Appendix 4.1 R Code: Automotive sales example used in Section 1 ##### Import Google Trends Data google = read.csv(’googletrends.csv’); google$date = as.Date(google$date); ##### Sales Data dat = read.csv(quot;FordSales.csvquot;); dat$month = as.Date(dat$month); ##### get ready for the forecasting; dat = rbind(dat, dat[nrow(dat), ]); dat[nrow(dat), ’month’] = as.Date(’2008-09-01’); dat[nrow(dat), -1] = rep(NA, ncol(dat)-1); ##### Define Predictors - Time Lags; dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)]); dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)]); ##### Plot Sales & Google Trends data; par(mfrow=c(2,1)); plot(sales ~ month, data= dat, lwd=2, type=’l’, main=’Ford Sales’, ylab=’Sales’, xlab=’Time’); plot(trends ~ date, data= google, lwd=2, type=’l’, main=’Google Trends: Ford’, ylab=’Percentage Change’, xlab=’Time’); ##### Merge Sales Data w/ Google Trends Data google$month = as.Date(paste(substr(google$date, 1, 7), ’01’, sep=’-’)) dat = merge(dat, google); ##### Define Predictor - Google Trends ## t.lag defines the time lag between the research and purchase. ## t.lag = 0 if you want to include last week of the previous month and ## 1st-2nd week of the corresponding month ## t.lag = 1 if you want to include 1st-3rd week of the corresponding month t.lag = 1; id = which(dat$month[-1] != dat$month[-nrow(dat)]); mdat = dat[id + 1, c(’month’, ’sales’, ’s1’, ’s12’)]; 19
  • 23. CHAPTER 4. APPENDIX 20 mdat$trends1 = dat$trends[id + t.lag]; mdat$trends2 = dat$trends[id + t.lag + 1]; mdat$trends3 = dat$trends[id + t.lag + 2]; ##### Divide data by two parts - model fitting & prediction dat1 = mdat[1:(nrow(mdat)-1), ] dat2 = mdat[nrow(mdat), ] ##### Exploratory Data Analysis ## Testing Autocorrelation & Seasonality acf(log(dat1$sales)); Box.test(log(dat1$sales), type=quot;Ljung-Boxquot;) ## Testing Correlation plot(y = log(dat1$sales), x = dat1$trends1, main=’’, pch=19, ylab=’log(Sales)’, xlab= ’Google Trends - 1st week’) abline(lm(log(dat1$sales) ~ dat1$trends1), lwd=2, col=2) cor.test(y = log(dat1$sales), x = dat1$trends1) cor.test(y = log(dat1$sales), x = dat1$trends2) cor.test(y = log(dat1$sales), x = dat1$trends3) ##### Fit Model; fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1); summary(fit) ##### Diagnostic Plot par(mfrow=c(2,2)); plot(fit) #### Prediction for the next month; predict.fit = predict(fit, newdata=dat2, se.fit=TRUE);