Internet-scale multimedia retrieval

Transkript

Internet-scale multimedia retrieval
RNDr. Jakub Lokoč, Ph.D.
Siret Research Group (www.siret.cz)
Department of SW Engineering
Faculty of Mathematics and Physics
Charles University in Prague
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
1
http://royal.pingdom.com
statistics for 2011

2.1 billion – Internet users worldwide

3.146 billion – number of email accounts worldwide

800+ million – number of users on Facebook

555 million – number of websites (+300 million in 2011)

1 trillion – number of video playbacks on YouTube
 48 hours – amount of video uploaded to YouTube every minute

MM data
100 billion – Estimated number of photos on Facebook
 4.5 million – Number of photos uploaded to Flickr each day
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
2
Storage
Scalability
Searching
Security
…
Accessibility
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
3

Text-based techniques
 Advantage – scalable retrieval by inverted files
 Problem – missing or misguiding annotations

Content-based techniques
 Advantage – no annotation needed, visual similarity
 Problem – slow retrieval for complex similarity models

Hybrid techniques
 Text-based query + content-based reranking/exploration
 Content-based query + text-based filtering
 Adapting content-based data for inverted files
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
4

Document vector model
 User issues keywords query (google, bing, …)
 Efficient query evaluation using inverted files

Problems
 Manual annotation only for small data
 Subjectivity of the annotation
 Homonyms, etc.

Automatic annotation
 Surrounding text + linguistic methods + ontologies
 Content-based keyword assignment
 Still lot of problems to solve…
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
5

Text-based retrieval
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
6

All objects transformed into a similarity model
 Objects represented by descriptors (histograms, signatures)
 Descriptors measured by a distance measure d (Lp, SQFD, EMD)

User issues an example object as a query q
Feature
extraction
Similarity
evaluation
Feature
extraction

Objects x sorted according to the visual similarity d(q, x)

How to solve efficiency problem?
query object
 Hybrid techniques – not whole DB is searched in the CB way
 Distance-based indexes or filter-and-refine methods

Distributed architectures needed (storage, throughput, …)
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
7

Hybrid techniques – reranking page 1
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
8

Hybrid techniques – reranking page 2
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
9

Hybrid techniques – exploration
J. Lokoč, T. Grošup, T. Skopal
Image Exploration using Online
Feature Extraction and Reranking
ICMR, 2012, Hongkong, China, ACM
J. Lokoč, T. Grošup, T. Skopal
SIR: The Smart Image Retrieval Engine
SISAP, 2012, Toronto, Canada, Springer
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
10

When a distance measure is a metric, we can employ metric indexes for fast
query processing

Ball partitioning


Hyperplane partitioning


M-Tree, PM-Tree, LoC
GNAT, M-Index
Mapping methods

LAESA, Omni family
Zezula, P., Amato, G., Dohnal, V., Batko, M.
Similarity Search: The Metric Space Approach
(Springer, 2006)
J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer
Cut-region: A Compact Building Block For Hierarchical Metric Indexing
D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier
Metric Index: An efficient and scalable solution for precise and approximate similarity search
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
11

Efficiency depends mainly on
the distance distribution in
the distance space

Indicator of data “indexability”
 Intrinsic dimensionality
iDIM = mean2 / (2*variance)
 High iDIM = bad indexability
( curse of dimensionality)
q
o1
p1
o2
p2
E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin
Searching in Metric Spaces, ACM Computing Surveys, 2001
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
12

Relaxing precission
 Approximate search
 Distance space transformation
 Synergistic modeling

Distributed computing (brutal force)
 Peer-to-peer architecture
 Parallel processing on local nodes
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
13

Based on various ideas






Early termination for good results
Reducing query radius
Zezula, P., Amato, G., Dohnal, V., Batko, M.
When time elapses
Similarity Search: The Metric Space Approach
(Springer, 2006)
Accessing % of DB
Also distance modifications
However, for fast retrieval, the quality
deteriorates rapidly
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
14

Nonlinear transformations of the distance space
 Monotonous transformation = same similarity ordering
 Problems with metric properties
▪ If t = x2 then 2 + 2 ≥ 4 but 22 + 22 < 42
▪ Approximate search with MAMs
T. Skopal, Unified framework for fast exact and approximate search in
dissimilarity spaces, ACM Transactions on Database Systems, 2007
T. Skopal, J. Lokoč, NM-tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces
LNCS 5181, Springer, 2008, DEXA, Turin ,Italy
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
15

Design indexable space (not only precission)
 Join the world of the domain experts and focus also on iDIM

Many factors influence iDIM
 Extracted features
▪ Sampled points
▪ Kvantization
▪ Clustering
 Similarity measure
▪ Linear combinations
▪ Inner parameters
Indexable
space
Let as remember
also the MAP graphs
Ch. Beecks, J. Lokoč, T. Seidl, T. Skopal, Indexing the Signature Quadratic Form Distance for Efficient
Content-Based Multimedia Retrieval, ACM ICMR 2011, Trento, Italy, ACM
J. Lokoč, Ch. Beecks, T. Seidl, T. Skopal, Parameterized Earth Mover’s Distance for Efficient Metric
Space Indexing, SISAP 2011, Lipari, Italy, ACM
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
16

Peer-to-peer architecture
 Chord protocol (efficient routing)

M-Chord, M-Index
 Map objects from U to real domain R
 Use chord protocol for object distribution
 Query causes interval queries, results merged
D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure
InfoScale, 2006, ACM
D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed
Metric Index, Information Processing & Management
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
17
• Synergistic modeling
• Distance modifications
• Distributed index
• Approximate search – limit routing
• Local node index
• Approximate search in local nodes
• Parallel processing
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
18
… any questions?
27.11.2012
Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze
19

Podobné dokumenty

Rozšířená realita a nové možnosti tvorby publikací

Rozšířená realita a nové možnosti tvorby publikací Augmented reality is one of the trends in ICT which will change the information behaviour fundamentally. It is based on the idea of adding another layer of information to the physical reality which...

Více

Termostat THC 02 projednotky HWD a TWE

Termostat THC 02 projednotky HWD a TWE V˘robce trvale zdokonaluje své produkty a vyhrazuje si proto právo kdykoli zmûnit jakékoli detaily t˘kající se produktu. Tato publikace je v‰eobecn˘m prÛvodcem instalací, pouÏitím a fiádnou údrÏbou ...

Více

HORN-kazetový systém 840

HORN-kazetový systém 840 Upevňovací šrouby kazety jsou součástí dodávky základního držáku - není potřeba je objednávat samostatně. Ordering note: The fastening screw is combined with the basic toolholder - no seperate orde...

Více

Rožmberk

Rožmberk Tyran Beroun

Více