4. What is concert video mashup ?
• A concert video mashup process is to
deal with all videos captured from
different locations of a concert hall and
convert them into a complete, non-
overlapping, seamless, and high-quality
outcome.
4
5. Why concert video mashup ?
• To provide people who could not attend live concert a
second chance to enjoy the performance with similar
quality.
5
6. Many problems to be solved !
• Videos were captured with no coordination,
incompleteness or redundancy happens
always.
• The order to watch these videos often causes
confusion.
• These videos were captured by handheld
devices, their visual/audio quality cannot be
guaranteed.
6
7. Issues need to be addressed
• The order to watch
• Visual quality optimization
• Seamless sound track connection
• No redundancy
• No missing video segments
• Mashup results follow the rules defined by
language of film
7
8. Potential Issues: The order to watch(1/5)
• Three video clips captured from 3 different
angles, different distances, 1&2 partially
overlapped, 3 independent
8
1 2
3
9. Potential Issues: Multiple audio sequence
alignment (2/5)
Case 1: partially overlapped
Case 2: no overlap
9
10. Potential Issues(3/5)
• Among three videos coherent in time, which
one should be chosen ? (3 different locations)
-- follow the rules of language of film !
10
Medium
Shot
Long
Shot
Extreme Long
Shot
11. • Among several qualified videos clips, which
one should be chosen ? Same distance !
-- visual quality ? audio quality ?
11
Potential Issues(4/5)
Extreme Long
Shot
Extreme Long
Shot
12. Potential Issues(5/5 )
• How to present the emotion, ideas, and art of
a music director into a concert video mashup
process ? Can a CNN learn facial emotion ?
12
13. Previous Effort
• The closest research area to ``automatic video
mashup’’ is ``summarizations of multi-view
videos’’
• The objective of the latter is to produce a
reduced set of abstracted videos or key-frame
sequence that can represent the most
prominent parts of the input videos.
13
14. Literatures related to video mashup
(1/3)
• [Shrestha et al.] formulate video mashup as an
optimization problem
- pros – optimizing visual quality and
diversity constraints
- cons – did not take into account
professional view of a visual storytelling director
P. Shrestha et al., automatic mashup generation from multiple-
camera concert recordings, ACM MM, 2010.
14
15. Literatures related to video mashup
(2/3)
• [Wu et al.] put some pre-defined rules to solve
the frequent-shot-change problem
- pros – can solve part of the shot change
problem
- cons – did not involve a visual storytelling
director to instruct a video mashup process
Wu et al., MoVieUp: Automatic mobile video mashup, IEEE TCSVT,
2015.
15
16. Literatures related to video mashup
(3/3)
• [Saini et al.] introduce visual storytelling rules by dividing
audience seats into six shooting locations and then
calculate statistics of shot transition and length from
professionally edited videos
- pros – a good start by introducing the views of
professional experts
- cons – shot types defined by themselves, not by rules
defined in language of film
Saini et al., MoViMash: Online mobile video mashup, ACM MM, 2012.
16
17. Introduction
• An experienced movie director frequently use
camera work practice in visual storytelling.
Intro Verse
Verse Chorus
Chorus Bridge
Bridge
. . .
16
21. Introduction
• Definition from the
language of film [3], a
concert video contains
eight types of camera
shots.
20
Musical Instrument Shot (MIS)Audience Shot (ADS)
22. INTRODUCTION
• Two images from an official concert video of the song “93
million miles” by Jason Mraz live at Hong Kong 2012.
22
25. Object Representation (VGG-Net)
• Object representation using a 16-layer VGG-Net
• we extract features from the output layer and the two fully-
connected layers as the object representations, the feature
dimensions are 1000-D, 4096-D and 4096-D, respectively.
25
28. Literatures related to Fusion Strategy
• Early fusion
– Pros:
Take the advantage of combining various feature cues
– Cons:
High dimensional feature set may easily suffer from the
problem of data sparseness, and stress the computational
resources.
28
29. Literatures related to Fusion Strategy
• Late fusion
– Pros:
Without increasing the dimensionality
Interpret the performance of different classifiers and gain insight
into the role of multiple modalities during emotional expression
– Cons:
The assumption of conditional independence among multiple
modalities is inappropriate.
29
30. Shot Classification based on
EW-Deep-CCM
• A novel fusion strategy named Error Weighted Deep Cross-Correlation
Model (EW-Deep-CCM) is proposed to effectively combine the extracted
multilayer object representations.
30
34. Conditional Random Field-based (CRF)
Approach
• 1st trial: 30-frame fixed window size (not a systematic way to
smooth the results)
• 2nd trial: Recurrent Neural Network (RNN)
-- Problem: RNN needs pre-segmented data to derive best results,
but the shot type classification results generated are not well
segmented
• 3rd trial: Conditional Random Field (CRF)
34
35. OUR METHOD – Coherent-Net
Shot Type
Refinement
(CRF)
35
36. OUR METHOD – Coherent-Net
Framework
Shot Type
Refinement
(CRF)
( | ')P w w
( | )P w O
'
1
( | )= ( , '| )
( | ') ( '| )
( | ') ( '| )
N
n n
n
P P
P P
P P w o
=
≈ ⋅
≈ ⋅
∑
∏
w
w O w w O
w w w O
w w
CRF EW-Deep-CCM
( '| )P w O
36
37. (EW-Deep-CCM)
Likelihood
(DNN posterior
probability)
Cross-correlation Empirical
weight
1 1 1
( '| ) ( '| , ) ( | ) ( | , ) ( | , )
( | , ) ( | , ) ( | )
C D K
out out fc out
ij k k ij i i k j i k
i j k
out fc fc fc
i j k j j k ij ij
P w o P w w P w P o w P w
P w P o w P o
β α
α β
= = =
≈ Λ Λ Λ
× Λ Λ
∑∑∑
Shot Type
Refinement
(CRF)
( | ')P w w
( | )P w O( '| )P w O
37
38. 1 1' ', , ', 't tw w w−=w
1w=w 2w 3w 1tw − tw
( )
( )
1
( | ') exp , '
'
j
j
P F
=
∑w w w w
Z w
( ) ( )' exp , 'j
j
F
=
∑ ∑w
Z w w w
( )
( ) ( )1
1
exp , , ' , '
'
j j t t j j t
t j t j
t w w s wλ µ−
∝ +
∑∑ ∑∑w w
Z w
( ) 1{ } { } { } { ' }
,
1
exp
' t t t tmn w m w n om w m w o
t m n S t m S o O
λ µ−= = = =
∈ ∈ ∈
∝ +
∑ ∑ ∑∑∑1 1 1 1
Z w
( )
1
, '
0
j ts w
=
w
when and't
w o= t
w m=
otherwise
State-observation pairState transition
( )1
1
, , '
0
j t tt w w−
=
w
when and 1t
w n−
=t
w m=
otherwise
(CRF)
unary potentialpairwise potential
CLCCCC
CCCCCC
38
42. Problem & Goal
• A concert video
mashup process needs
to align the videos
taken by variant
audiences into a
common timeline.
42
43. Literature Review
• Audio fingerprinting
• Problems
– Originally designed for the problem of audio
identification rather than that of time alignment.
– Easily cause audio signal distortion
• Zhu et al. treat audio identification as an image
matching problem. (significant performance improvement)
• B. Zhu et al., “A novel audio fingerprinting method robust to time
scale modification and pitch shifting,” ACM MM, 2010.
43
44. Our Method
• We modified Zhu’s method to address the multiple
audio sequences alignment problem.
– Auditory image (spectrogram) construction
1-D audio signal (waveform) 2D auditory image
Time-frequency representation
(spectrogram)
Short-time
Fourier
transform
44
45. Our Method
– Audio Sequences Alignment
(1) Boundary candidate selection (based on SIFT alignment)
-where a is a SIFT feature in audio sequence A, b is the closest
feature of a in B, b’ is the second closet feature of a in B.
bA Ba
'
, ( , ) ( , )
,
Yes if D a b c D a b
BC
No otherwise
< ∗
=
BC: boundary candidate
D(.): Euclidean distance
c: a constant (c=0.7)
Yellow lines are
boundary candidates
45
46. Our Method
– Audio Sequences Alignment
(2) Boundary candidate refinement.
-A window distortion measure (WDM) is defined for each
boundary candidate refinement.
46
47. Our Method
– Audio Sequences Alignment
(3) Final boundary decision.
-The alignment result is determined by a refined boundary
candidate that with minimum window distortion.
47
48. DEMO 1
• “I’m Yours” by Jason Mraz live at Singapore 2012
– with context search (Aligned in 49.8001 s)
48
Time
Line00:00:00 00:00:49.8001
Recording #4
Recording #5
+0.4334 s
49. DEMO 2
• “All I Ask” by Adele live at Birmingham Genting Arena 2016
– with context search (Aligned in 53.2169 s)
49
Time
Line00:00:00 00:00:53.2169
Recording #1
Recording #2
+0.5502 s
55. Motivations
• To develop an automatic tactics analysis
tool for coaches, players, and general
publics.
• To develop a new technique that can
compete with existing tools, such as
sportVU, but with much lower price
55
56. Methodology Adopted
• To analyze group behavior directly from the
court-view of an NBA broadcast video
• Detect and track each offense player,
calculate their trajectories and map these
trajectories from court view to tactic board
for analysis
56
62. Extracting features from an offense video clip ?
• Automatic player detection
• Automatic player tracking
• Map extracted trajectories from basketball
court to tactic board
62
63. step 2: Derive correct player trajectories on
panorama court (3/3)
63
64. step 3: Map trajectories from panorama court to
tactic board
64
65. What’s next ?
–Tactics Analysis based on
spatiotemporal trajectories
of 5 offense players
65
66. A Two-Stage Un-supervised Clustering
for Tactic Analysis
• Stage-1: Un-supervised clustering of all available
tactics based on their mutual distances
• Stage-2: Un-supervised clustering of all tactics
clustered into the same cluster in Stage-1 (try to
separate the role of each offense player)
66
67. What techniques are needed ?
• A spatiotemporal model that can describe the
group behavior of 5 offense players
• Automatic clustering of group behaviors
(screen-cut, Princeton, wing-wheel, etc)
• Representation of each group behavior
• An appropriate metric to calculate the distance
between two arbitrary tactics.
67
68. Trajectory set Representation
S: the spatiotemporal matrix;
Pij=(xij,yij): 2D coordinate of the j-th player in the i-th frame;
Vj=[P1j P2j… PLj]T;
S=[V1 V2 V3 V4 V5 (V6)];
69. Distance Measure of Trajectory Set
• Problems
• Different time durations between 2 clips
• Ordering of column vectors
71. Clustering by Dominant Set
PAMI 07. Massimiliano Pavan and Marcello Pelillo. Dominant Sets and Pairwise Clustering
Tactic1
Tactic2
Tactic3
72. Second-stage: how to model an offense strategy ?
• 8 different trajectory sets of right hawk, each consists of
5 trajectories generated by 5 offense players
73. Clustering by Trajectory Distance
• Based on the distance between trajectories, one can separate each
group of tactics into five group of trajectories, each corresponds to
a role (an offense player)
Hawk
Wing
Wheel
Princeton
74. Temporal Alignment
For each role, we use the velocities along x- and y-direction,
respectively, to model it (use DTW to solve the alignment
problem)