Forget Me Not:
Maximizing Memorability for Automatic Video Summarization
Nicholas Pai
Adviser: Jianxiong Xiao
Princeton Un...
I will discuss these different approaches in Section 2.1.2,
each of which is well-suited to different applications.
Figure...
the original video, making itself a video clip. An example
of a video skim is the movie trailer. The video skim K con-
tai...
Figure 3: Histogram of memorability scores for
base jumping.mp4. 4731 frames, mean: 0.644725,
standard deviation: 0.067470...
Figure 4: Input video into ffmpeg to create image sequence. Partition contiguous images into buckets, and assign memora-
b...
balance of its content respective to where the content was
originally located in the input video. A summary employ-
ing hi...
Figure 5: Visual comparison of ’all’ vs ’best’ summary coverage, each comprising 709 frames. Eight frames from each
method...
Figure 6: These are the buckets covered by the ’best’
method for base jumping.mp4. Notice how the buckets are
clustered to...
compres-
sion
ratio
frames
per
bucket
coverage
method
Experimental
improvement over
original (%)
Experimental
improvement ...
A. de Albuquerque Ara´ujo. Vsumm: A mechanism designed
to produce static video summaries and a novel evaluation
method. Pa...
Próximos SlideShares
Carregando em…5
×

written_final_report

44 visualizações

Publicada em

  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

written_final_report

  1. 1. Forget Me Not: Maximizing Memorability for Automatic Video Summarization Nicholas Pai Adviser: Jianxiong Xiao Princeton University npai@princeton.edu Abstract Rapid growth in video data consumption and production has necessitated the continued development of video sum- marization techniques. However, there is no existing video summarizaton technique that accounts for the summary’s memorability, a video attribute that is independent from in- terestingness and aesthetics (the two other commonly em- phasized attributes in summarization techniques). There- fore, I propose the novel method of summarizing videos by maximizing its memorability. My main motivation is that highly memorable summaries are particularly well-suited for certain applications. Moreover, existing summarization techniques cannot address these applications because they do not emphasize memorability as a key attribute in their summaries. In order to calculate a video’s memorability, I employ the predictive power of a deep network, MemNet, trained to predict real-image memorability at a near hu- man consistency. Experiments show that I am consistently able to increase a summary’s memorability from the origi- nal video by using a ”buckets”-based algorithm for deter- mining which video frames to extract. I also discuss the merits of alternative algorithms for extracting video sum- maries, which are not as successful quantitatively but are better suited for certain use cases. 1. Introduction In this modern age, there are two interrelated trends in technology that have produced the need for automated video summarization. First, the rapid growth in ownership of mobile phones and other consumer-oriented camera devices has directly contributed to a tremendous growth in user-generated video data [2]. Secondly, the average user is consuming significantly more videos than even in the recent past, which is in large part a product of the omnipresence of screens (i.e. laptop, mobile phone, smart watch, tablet, etc.) available to the user. For example, YouTube users now upload over 400 hours of video data to the site every minute, and more than half of YouTube views come from mobile devices, up 100% y/y [22, 24]. Although modern consumers of video data stand to benefit greatly from the concurrency of these two trends, they also inherently face several key challenges. Subsequently, the growing importance of video summarization is grounded in its ability to address three of these challenges for the user: quick browsing (finding the relevant video), quick retrieval (finding the relevant information within a video), and efficient storage. To better illustrate quick retrieval, imagine the case of viewing home surveillance videos. These videos are usually long and contain few events of action, but these rare events are the often the most important to see for the viewer. An effective video summary of this surveillance footage would allow the user to efficiently retrieve these important events. Video summary is a family of techniques aimed at helping the user to understand videos at a faster pace than normal and to facilitate browsing of a video database [9]. Thus, it intrinsically addresses the aforementioned challenges. The only requirement of video summarization is that it must encapsulate the most informative parts of a video. There is a substantial amount of research on automatic video summarization, but a consistent evaluation framework is seriously missing from the field [23]. This arises in part because, unlike other fields such as object classification, there is no objective ground-truth for evaluat- ing the correctness of a video summary. Therefore, within the field of automatic summarization there exists a wide variation of techniques. The two main types of automatic video summaries are static and dynamic, which I will discuss further in Section 2.1.1. Furthermore as a result of the nonexistence of an ”optimal” summary, these prior approaches have emphasized different video attributes such as interestingness, objects, web-images as priors, or the extraction of low-level features as a means of preserving the most important parts of the original video [8, 18, 13, 3]. 1
  2. 2. I will discuss these different approaches in Section 2.1.2, each of which is well-suited to different applications. Figure 1: These are plots from [11]. In each plot, each dot refers to a single image. Human participants were asked to judge the interestingness, aesthetics, and memorability of 2222 images. The goal of my project is to develop a novel method- ology for the production of dynamic video summaries that attempts to maximize the memorability value for each sum- mary. Although I further discuss research in the fields of defining and predicting memorability in Section 2.2, I will briefly describe my motivation for applying memorability to automatic summarization. Interestingly enough, an image’s memorability can be represented as a single real-valued out- put [14]. In fact, there is substantial consistency in our ability to to remember and forget the same images [10]. Therefore, in my project I employ the prediction of image memorability for predicting video memorability. Moreover, image memorability is distinct from an image’s aesthetics or interestingness, the two other commonly used subjective image properties (Figure 1) [11, 7]. These two image properties are commonly extracted for video summarization, which has led me to hypothesize that a video summary emphasizing memorability can uniquely address certain applications that previous summary tech- niques could not. For example, students could study more efficiently from educational video summaries that empha- size memorability. The ability to recall a video is especially important for educational purposes, for which video aes- thetics and interestingness are less relevant. Another po- tential application would be in the movie industry, where movie trailers are essentially video summaries. A particu- larly memorable movie trailer might lead a viewer to share the trailer with others and ultimately drive more demand to see the film. Viewed from another angle, a movie trailer that is nice to look is practically useless if it is very for- gettable. Therefore, the key idea underlying my project is that it aims to contribute the unique approach of maximizing memorability to the field of automatic video summarization for use in applications that other summarization techniques are less-suited to serve. 2. Related Work Within the context of related work on video summariza- tion and memorability, the main contributions of my work are: • A novel intuition to select maximally memorable clips from videos by using a Convolutional Neural Network (CNN) to predict memorability based solely on intrin- sic visual features. • A simple and widely-applicable evaluation framework based on comparisons of memorability with both the input video and a baseline summarization. 2.1. Video Summarization The sheer growth in volume and accessibility of video data has necessitated a mechanism that allows the user to gain certain perspectives of a video document without watching the video in its entirety. The main challenges associated with video summarization, or (also referred to as ”video abstraction”), methods are that the task is difficult to define and evaluate, and many methods are domain-specific (sports, news, user-generated content, etc.) [21]. The devel- oped techniques of automated solutions to the generation of video abstracts address different assumptions on what constitutes an optimal video summary. For instance, this project assumes that an optimal video summary maximizes its memorability. The next two sections will provide a brief overview of the types of video summaries and techniques for generating them, and existing methods for evaluating the performance of video summaries. I will also contextualize my project within the existing literature. 2.1.1 Types of Summaries There are two basic forms of video summaries: visual in- dexing and video skimming [8, 23]. Video indexing is a technique that selects keyframes that best summarize a video [13, 18, 3, 5, 16]. The keyframe set R containing n frames is defined as follows: R = Akeyframe(V ) = {f1, f2, . . . , fn} (1) where Akeyframe denotes the keyframe extraction proce- dure. Keyframes are extracted based on change detection [5] or clustering based on low-level features [3] and objects [18]. Other techniques use web-priors to find important frames [13, 16]. A slide show is an example of a summary based on keyframes. Video skimming replaces a video with a shorter com- pilation of its fragments, preserving motion information whereas indexing does not [20, 8, 19]. This type of sum- marization consists of a collection of video segments from 2
  3. 3. the original video, making itself a video clip. An example of a video skim is the movie trailer. The video skim K con- taining n excerpts is defined as follows: K = Askim(V ) = {E1, E2, . . . , En} (2) where Askim denotes the skim generation procedure and Ei ⊂ V is the ith excerpt to be included in K. Video skim- ming extraction occurs on the semantic-level, often taking user annotations (e.g. object bounding boxes, labels) as in- put [19, 20]. Web priors can also be used to compile skims [13]. The key advantage of video skimming over video in- dexing is the preservation of motion information that po- tentially enhance both the expressiveness and information of the summary. Furthermore, humans usually find it more entertaining and interesting to watch a skim rather than a slide show of frames. For these reasons, I decide to imple- ment a video skimming technique. However, my method will not appeal to external web priors or inputted annota- tions. Instead it will acquire a skim using solely the image properties of the frames contained within the video, giving my algorithm more scalability and flexibility with regards to inputted videos. 2.1.2 Evaluating Summarization Techniques There does not currently exist the notion of an ”optimal” video summary, resulting in the fact that every work has its own evaluation method and rarely compares its performance with existing techniques. Manual evaluation, as in [20, 18], of summaries might be ideal for very small data sets, but it was outside the scope of this one-semester project. Furthermore, even manual evaluations of sum- maries often do not agree. The majority of other existing video summarization evaluation methods appeal to expert opinion or crowd-sourcing for evaluation. But, these evaluation methodologies do not scale well with the size of video data [13]. More recent efforts attempt to produce automatic frameworks for evaluating any summary, but these have key limitations for my project. Khosla et al. introduced a crowd-sourcing based automatic evaluation framework, which mitigates the inconsistency between human judges, but was designed specifically to evaluate keyframe rather than skimming techniques [13]. The automatic evaluation framework proposed by Gygli et. al is likely the best method existing today because it takes into account multi- ple ”ground truth” summaries. However, these summaries are judged immediately after viewing, whereas the memo- rability of a summary by definition can only be evaluated after a significant amount of time has passed since viewing. Thus, the value of these ”ground truth” summaries do not account for the ability to recall the summary, rendering them inapplicable to my project. Currently, there does not exist an automatic evaluation framework in which memorability influences the value of the ”ground truth” baseline, which represents an unfortunate research gap. Truong & Venkatesh [23] provide an overview of ex- isting methods for evaluating the performance video sum- marization techniques and provide practical recommenda- tions towards a consistent evaluation framework. They con- clude that the main focus of the evaluation process should be application-dependent, which I agree with. Although it is helpful to understand existing evaluation techniques, they are not ultimately relevant for my project because existing methods do not use memorability as a parameter for eval- uation. Therefore, I will compare the overall memorability score of my summaries against (1) the original video and (2) a baseline summary created from uniform sampling. 2.2. Memorability Figure 2: Histogram of memorability scores for cook- ing.mp4. 1289 frames, mean: 0.756698, standard deviation: 0.050337. Humans have the ability to not only remember a lot of images at once but also in great detail [1, 17]. In fact, computer vision works have been able reliably predict memorability ranks of new images and faces based purely on intrinsic visual features[11, 15, 4]. However, the task of successfully predicting human visual memory seems out of reach for an artificial system. This is due to the fact that unlike visual classification, images are memorable or forgettable for many reasons. It seems exceedingly difficult to design an algorithm that can cluster similarly memorable images together based on common visual features. For example, both an image of a bear making a funny face 3
  4. 4. Figure 3: Histogram of memorability scores for base jumping.mp4. 4731 frames, mean: 0.644725, standard deviation: 0.067470. and an abstract painting might be equally memorable for a viewer, but they are certainly not memorable for the same reasons. Thus, rather than attempt to design an artificial system capable of predicting memorability through com- plex algorithms, I decide to build off of the prior work of a deep network trained to associate memorability scores with real images [14]. Khosla et. al’s model, MemNet, is a Convolutional Neural Network (CNN) trained on LaMem, a dataset containing 60,000 images with memorability scores from human observers. LaMem represents the diversity of human visual experience, rendering MemNet extremely applicable for predicting the memorability of non-animated videos. MemNet is the first near-human predictor of human visual memory, reaching a rank correlation of .64, near human consistency .68 (Table 1). Thus, I will employ the predictive power of MemNet for generating my maximally memorably video summaries. The main benefit of building off of the work of MemNet is that I can spend more time developing an application (i.e. video summarization) employing a deep network rather than imitating or even improving upon the predictive levels of MemNet. See Figures 2 and 3for distributions of mem- orability scores for sample videos, as predicted by MemNet. 3. Dataset I experimented mainly on the SumMe dataset, which is a set of 25 partially downloaded videos from YouTube made available by Gygli et. al [8]. I elected to use this Train set: LaMem fc6 fc7 fc8 MemNet no FA 0.54 0.55 0.53 0.57 with FA 0.61 0.61 0.60 0.64 Table 1: Rank correlation of training MemNet on LaMem dataset. The reported performance is averaged over various train/test splits of the data. ’fc6’, ’fc7’, and ’fc8’ refer to different layers of MemNet [14], and ’FA’ refers to false alarms. dataset more out of convenience than for the sake of com- parison with Gygli et. al’s summaries. As emphasized in Section 2.1.2, the Gygli et. al automatic evaluation frame- work compares summaries with ”ground truth” summaries that do not take memorability into account. Therefore, it was helpful to qualitatively compare my summaries with those provided by Gygli et. al, but to quantitatively com- pare would not have yielded any useful insight. MemNet [14] is well-suited to predict memorability of frames from this dataset of user-generated videos because it was trained exclusively on real-images. Therefore, I hypothesize that my algorithm would be well-suited to summarize non-user- generated videos, such as movies or television shows, pro- vided that they are not animated. However, I decided not to explicitly experiment with videos of these types for several reasons including time and space constraints and accessi- bility challenges. Most of these videos cannot be legally acquired free of charge or easily. 4. Implementation 4.1. Technology Stack The MemNet model architecture is expressed in the Caffe deep learning framework [12]. I experiment with MemNet using the Python interface, Pycaffe, in an iPython notebook. 4.2. User parameters • float compression ratio: summary length input video length • String coverage method: all, best • String video name I expose three parameters, compression rate, coverage method, and the original video file, to the user. The com- pression rate dictates to my algorithm a priori the number of frames that the resulting summarization should comprise. For instance, if the user inputs .15, then the summarized video will be 15% as long as the original. I will discuss the coverage method parameter in depth in Secton 4.4, but it es- sentially allows the user to determine how much coverage of the original video the user wants to guarantee is preserved 4
  5. 5. Figure 4: Input video into ffmpeg to create image sequence. Partition contiguous images into buckets, and assign memora- bility score to each bucket using Equation 3. in the resultant summary. Different coverage methods are suited to different types of videos. The goal of my imple- mentation is to extract a sequence of video clips from some portion of the buckets and to then stitch these together to form a video clip, as described in Section 2.1.1. 4.3. Initialization First, I convert the original video file to an image se- quence by making use of the command-line tool ffmpeg [6]. I elected to not use any options in my function call, which keeps the frame size and number of frames per second con- sistent from video frame to image; this worked well for my results. An example call to ffmpeg would be: ffmpeg −i v i d e o f i l e imageSequenceDirectory / image%04d . png Having produced an image sequence, my next step was to partition the image sequence into sequential buckets con- taining contiguous frames. Then, I assigned each bucket a memorability score by making use of MemNet [14]. De- termining the algorithm for scoring the memorability of a bucket was not a trivial task. Ultimately, I found that the best method was to equate the overall bucket’s memorabil- ity to the mean of the memorability scores for each of the n frames contained within the bucket: mem(bucket) = n i=1 mem(frame) n (3) where mem(frame) is equal to the predicted memorabil- ity score by MemNet for a particular frame. I experimented with accounting for variation and the highest-memorability of a contiguous sub-sequence, but I found that quantifying the influence from these attributes was too complex and out- side the scope of this project. For instance, it is unclear how the variability of memorability scores for a given bucket should influence the bucket’s memorability and ultimately the summary’s memorability. The buckets approach, out- lined visually in Figure 4, prepares the sequence well for the next step of determining summary coverage. 4.4. Summary Coverage: Selecting which buckets to extract frames from Given buckets that contain contiguous sequences of frames from the input video, we are left with flexibility in: • Determining which buckets to extract contiguous sub- sequences of frames from (to subsequently convert to video clips). • Choosing the number of frames the sub-sequences should contain from selected buckets. These decisions can be addressed by solely taking into ac- count each bucket’s memorability score, as determined in Section 4.3. Determining which buckets to ultimately ex- tract video clips from is equivalent to specifying the sum- mary’s coverage. I define a summary’s coverage as the 5
  6. 6. balance of its content respective to where the content was originally located in the input video. A summary employ- ing highly balanced coverage would contain content from as many parts of the original video as possible. Conversely, a highly imbalanced coverage would contain content from a particularly narrow part of the original video. In my exper- iments, I worked with two coverage methods to determine which buckets to select frames from: ’all’ and ’best’. The ’all’ method pulled a sub-sequence from all of the buckets, while the ’best’ method pulled a sub-sequence from the top- half of buckets as determined by their memorability scores. Regarding coverage, the ’all’ method inherently produced a video summary with very balanced coverage because it con- tained content from all portions of the original video. The ’best’ method sometimes produced a video summary with a lot of the content coming from a particular section (i.e. a highly memorable section) of the original video. For ex- ample, I partitioned the SumMe video, base jumping.mp4, initially into 48 buckets. All 48 are represented by the ’all’ method, while the 24 buckets represented by the ’best’ method, shown in Figure 6, are clustered toward the begin- ning half of the video. Figure 5 visually demonstrates the coverage differences between the ’all’ and ’best’ methods. As substantiated by the statistics from this particular video, overall the ’all’ method produced more memorable summaries. In fact, in my experiments the ’all’ method always produced a summary that was more memorable than the original video, whereas the ’best’ method sometimes produced a summary that was more forgettable than the original. However, maintaining an application-level perspective is important when comparing these methods, for they apply differently toward different types of input videos. For example, footage from a home surveillance video contains mostly unchanging content, except for rare clusters of highly important information. Thus, the ’best’ method is hypothetically better suited toward this type of footage than the ’all’ method because it might constrain its summary content to just the few sections of action. Conversely, the ’all’ method would necessarily produce a summary containing a lot of unchanging content from surveillance footage. Thus, I exposed coverage method as a parameter to the user for flexibility when dealing with different types of videos. 4.5. Determining the number of frames to pull from selected buckets I determine a priori the number of frames that I extract from selected buckets, as determined by the compression ratio parameter. For example, ”base jumping.mp4” con- verted initially to 4732 frames, so a .15 compression ra- tio required a summary length of about 709. The desired number of frames for the summary dictated how many con- tiguous frames I could extract from each bucket. Once I have determined which buckets to pull from, I normalize each of the selected buckets’ memorability scores so that they add to 1.0. The normalized memorability score for a given bucket informs me how many frames, nf, I should take from the bucket i: nfi = compression ratioi ∗ original number framesi (4) Finally, I extract the most memorable sub-sequence of frames of the desired length nf from the given bucket. In- cidentally, the reason why I choose to extract a continu- ous sub-sequence of frames rather than the most memo- rable frames from each bucket is because I wanted to pro- duce a summary that was aesthetically pleasing in addition to being maximally memorable. A summary containing non-continuous frames would no longer hardly resemble a video, but would look like a rapidly sped up, incoherent slideshow. The last part of my algorithm is to combine each of these sub-sequences from selected buckets into a view- able summary. 4.6. Converting from sub-sequences to summary Once I have employed the methods described in Section 4.5 and acquired all of the sub-sequences, all that remains to be done is to collect the sub-sequences into a final sequence. Like converting from a video to an image-sequence, I use ffmpeg to convert this resultant sequence into a video file. 5. Evaluation Although there are no prior video summarization tech- niques accounting for memorability, I still attempt to eval- uate my summaries quantitatively by comparing to baseline summary that I produced for each combination of param- eters. Overall, the ’all’ method produced summaries that improved memorability over both the original videos and baseline summaries. The ’best’ method had mixed results in producing summaries that sometimes improved memora- bility over the original and the baseline. 5.1. Quantitative Memorability statistics of the two methods for ”base jumping.mp4” are shown in Table 2. For both coverage methods and for any particular compression rate, I created a baseline summary to compare my experimental summaries with. My baseline implementation extracts the first b frames from each of the initially partitioned buckets, producing a uniform sampling of frames across all buckets i: bi = compression ratio ∗ original number of framesi number of buckets (5) For example, I initially partition the 4732 frames in base jumping.mp4 into 48 buckets at 100 frames per 6
  7. 7. Figure 5: Visual comparison of ’all’ vs ’best’ summary coverage, each comprising 709 frames. Eight frames from each method are shown sequentially, both in intervals of 100 frames (i.e. image0100.png, image0200.png, etc.). Time stamp below each frame is the approximate time stamp at which the frame appears in the original video. compression ratio frames per bucket coverage method original score experimental summary score baseline summary score 0.05 100 all 0.644725 0.644598 0.708159 0.10 100 all 0.644725 0.644712 0.704605 0.10 1000 all 0.644725 0.640886 0.728542 0.15 100 all 0.644725 0.647021 0.698025 0.15 1000 all 0.644725 0.635472 0.709068 0.25 100 all 0.644725 0.647733 0.688528 0.05 100 best 0.644725 0.644598 0.651877 0.10 100 best 0.644725 0.644712 0.6403 0.10 1000 best 0.644725 0.640886 0.646896 0.15 100 best 0.644725 0.647021 0.631751 0.15 1000 best 0.644725 0.635472 0.637169 0.25 100 best 0.644725 0.647733 0.619118 Table 2: Memorability scores for ”base jumping.mp4”, a 3:04 long video or 4732 frames. Scores shown for different com- pression rates and coverage methods. I also experimented with different numbers of frames per bucket, but chose not to allow the user to specify this since it does not intuitively affect performance. I describe how I produce the baseline summmary in Section 5. bucket. The baseline summary extracts the first b frames from each of the 48 buckets. The baseline summary clearly does not attempt to maxi- mize the memorability for any selected bucket and so it acts as a trivial summary. For base jumping.mp4, I compare my experimental summaries with baseline summaries and orig- inal videos in Table 3. Similar improvement statistics for ten videos are shown in Table 4 for the compression ratio .15 and 100 frames per bucket. My results indicated that the ’all’ method performed better than the ’best’ method in im- proving memorability over both the original video and base- line summaries. The ’all’ summary improved memorabil- ity over the original video on average by 4.05% and 3.99% from the original and baseline, respectively, while the ’best’ summary on average improved memorability −1.05% from the original and 1.38% from the baseline. Additionally, the baseline summary did a decent job of trivially summarizing the original video by only changing memorability from the original by −1.28%. In summary, the ’all’ method success- fully improved memorability over the original video and the baseline summary, but the ’best’ method did not have a substantial impact on the original memorability. Also, the base jumping.mp4 statistics suggest that as the cover- age ratio decreases or the frames per bucket increase the corresponding memorability increases. However, the num- ber of frames per bucket cannot be endlessly increased, for it is directly proportional to the resultant summary’s cover- age. It is intuitive to understand that as the size of a bucket increases, the more original content it contains, which de- 7
  8. 8. Figure 6: These are the buckets covered by the ’best’ method for base jumping.mp4. Notice how the buckets are clustered toward the earlier parts of the original video. creases the amount of original content preserved in the sum- mary. This is part of the reason why I did not choose to allow the user to customize this number; I left it as the pro- grammer’s decision. 5.2. Qualitative Although the ’all’ summaries compare favorably to ’best’ summaries quantitatively, the reason why I allow the user to specify the coverage method is because I find that the ’best’ summaries are more aesthetically pleasing than the ’all’ summaries. This is because the ’best’ summaries combine longer sub-sequences from fewer buckets in Sec- tion 4.5. The result is that the ’all’ summaries can appear slightly jumpy, especially on shorter videos. This analysis is completely subjective and the ’all’ method is objectively su- perior to the ’best’ method in attempting to maximize mem- orability. In general, all of the summaries produced by my algorithm are aesthetically nice to view and appear to be a valid substitute for watching the original video. 6. Conclusion, Limitations, and Future Work While the ’best’ and ’all’ methods of generating video summaries failed to produce videos with large increases in memorability from the original video, there were many pos- itive contributions of this project. First, the ’all’ method succeeded in every video to improve memorability from the original video and baseline summary, especially as the frames per bucket increased and the coverage ratio de- creased. Thus, the project did succeed in creating a unique type of video summary while accounting for memorability. Furthermore, the summaries were aesthetically pleasing to look at, which suggests that this project has practical impli- cations. 6.1. Limitations One of the main limitations of this project is that it relies completely on the predictive capabilities of MemNet [14]. Although MemNet is the most precise model of human visual memory, there needs to be further evaluation of its performance and progress in training a deep network to predict memorability. Perhaps multiple memorability predictive techniques can be combined so as to mitigate errors. Secondly, further development of certain basic algorithms like Equation 3 is needed to increase memo- rability improvements. Finally, the memorability of any frame or bucket is scored without any temporal or story context; it is scored completely based on its intrinsic elements. Unlike an image which exists without context, a video’s memorability is almost certainly affected by its plot and the relationships between its multiple sequences. This is another reason why Equation 3 is likely not the optimal way to score a bucket or a summary’s memorability. Given more time and resources I would have liked human-subjects to evaluate my summaries to further vali- date whether they were in fact more memorable and aes- thetically adequate. 6.2. Future Work In conclusion, this project hopefully lays a strong foundation for future research on the new intersection of video summarizaton and memorability. I believe that video summaries produced with an eye toward memorability are especially important for certain application fields such as education. The interestingness of a summary is not ultimately relevant for a student watching a video in order to learn its content, but memorability is vital for later applying its content in the classroom. An important next step for the intersection of automatic video summarization and memorability is the development of an automatic evaluation framework similar to that produced by Gygli et. al [8]. I suggest that this future framework incorporate multiple crowd-sourced ”ground truth” summaries, but the ”ground truth” values must must account for memorability. Acknowledgements Thank you to Professor Jianxiong Xiao and Fisher Yu for their guidance and support through- out this project. Honor Code I pledge my honor that this project represents my own work in accordance with University regulations. /S/ Nicholas Pai. 8
  9. 9. compres- sion ratio frames per bucket coverage method Experimental improvement over original (%) Experimental improvement over baseline (%) Baseline improvement over original (%) 0.05 100 all 9.84 9.86 -0.02 0.10 100 all 9.29 9.29 0.00 0.10 1000 all 13.00 13.68 -0.60 0.15 100 all 8.27 7.88 0.36 0.15 1000 all 9.98 11.58 -1.44 0.25 100 all 6.79 6.30 0.47 0.05 100 best 1.11 1.13 -0.02 0.10 100 best -0.69 -0.68 0.00 0.10 1000 best 0.34 0.94 -0.60 0.15 100 best -2.01 -2.36 0.36 0.15 1000 best -1.17 0.27 -1.44 0.25 100 best -3.97 -4.42 0.47 Table 3: Comparison of experimental and baseline summaries for ”base jumping.mp4”. The parameter combinations are the same as Table 2. video coverage method Experimental improvement over baseline (%) Experimental improvement over original (%) bike polo all 7.54 8.00 cooking all 5.04 4.90 playing ball all 3.35 3.42 base jumping all 7.88 8.27 valparaiso all 3.11 3.16 bearpark all 2.58 2.65 bus in tunnel all 3.38 2.86 waterslide all 2.10 2.40 paintball all 1.06 1.05 eiffel all 3.91 3.78 bike polo best -3.72 -5.69 cooking best 1.40 1.27 playing ball best 1.99 0.03 base jumping best 3.54 -2.01 valparaiso best 1.99 0.32 bearpark best 1.57 -1.00 bus in tunnel best 2.24 -1.31 waterslide best 1.58 -0.16 paintball best 0.60 -1.02 eiffel best 2.60 -0.96 Table 4: Comparison of experimental and baseline summaries for ten videos from SumMe data set [8]. The compression rate and frames per bucket for all videos is .15 and 100 respectively. References [1] T. Brady, T. Konkle, G. Alvarez, and A. Oliva. Visual long- term memory has a massive storage capacity for object de- tails. In Proceedings of the National Academy of Sciences of the United States of America, volume 105, 2008. [2] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measure- ment, IMC ’07, pages 1–14, New York, NY, USA, 2007. ACM. [3] S. E. F. de Avila, A. P. B. a. Lopes, A. da Luz, Jr., and 9
  10. 10. A. de Albuquerque Ara´ujo. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett., 32(1):56–68, Jan. 2011. [4] R. Dubey, J. Peterson, A. Khosla, M.-H. Yang, and B. Ghanem. What makes an object memorable? In Inter- national Conference on Computer Vision (ICCV), 2015. [5] N. Ejaz, I. Mehmood, and S. Wook Baik. Efficient visual attention based framework for extracting key frames from videos. Image Commun., 28(1):34–44, Jan. 2013. [6] FFmpeg. https://ffmpeg.org. [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Van Gool. The interestingness of images. ICCV, 2013. [8] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool. Creating summaries from user videos. In ECCV, 2014. [9] M. Huang, A. Mahajan, and D. DeMenthon. Automatic per- formance evaluation for video summarization. In Technical Report, 2007. [10] P. Isola, D. Parikh, A. Torralba, and A. Oliva. Understanding the intrinsic memorability of images. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2429–2437. Curran Associates, Inc., 2011. [11] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 145–152, 2011. [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [13] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan. Large- scale video summarization using web-image priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013. [14] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. Under- standing and predicting image memorability at a large scale. In International Conference on Computer Vision (ICCV), 2015. [15] A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In Advances in Neural Information Pro- cessing Systems (NIPS), Lake Tahoe, USA, December 2012. [16] G. Kim, L. Sigal, and E. P. Xing. Joint summarization of large-scale collections of web images and videos for story- line reconstruction. In Proceedings of the 2014 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR ’14, pages 4225–4232, Washington, DC, USA, 2014. IEEE Computer Society. [17] T. Konkle, T. Brady, G. Alvarez, and A. Oliva. Scene mem- ory is more important than you think: the role of categories in visual long-term memory. In Psychological Science, vol- ume 21, 2010. [18] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video summarization. In CVPR, pages 1346–1353. IEEE Computer Society, 2012. [19] D. Liu, G. Hua, and T. Chen. A hierarchical visual model for video object summarization. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 32(12):2178–2190, 2010. [20] Z. Lu and K. Grauman. Story-driven summarization for ego- centric video. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, pages 2714–2721, Washington, DC, USA, 2013. IEEE Com- puter Society. [21] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In ECCV 2014 - Eu- ropean Conference on Computer Vision, 2014. [22] M. R. Robertson. 500 Hours of Video Uploaded to YouTube Every Minute. http://www.reelseo.com/ hours-minute-uploaded-youtube/. [23] B. T. Truong and S. Venkatesh. Video abstraction: A sys- tematic review and classification. ACM Trans. Multimedia Comput. Commun. Appl., 3(1), Feb. 2007. [24] YouTube. Statistics. https://www.youtube.com/ yt/press/statistics.html. 10

×