Part IIIHierarchical Bayesian Models : Part III Hierarchical Bayesian Models
Slide2 : Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)
Vision : (Han and Zhu, 2006) Vision
Word learning : Principles Structure Data Whole-object principle
Shape bias
Taxonomic principle
Contrast principle
Basic-level bias Word learning
Hierarchical Bayesian models : Hierarchical Bayesian models Can represent and reason about knowledge at multiple levels of abstraction.
Have been used by statisticians for many years.
Hierarchical Bayesian models : Hierarchical Bayesian models Can represent and reason about knowledge at multiple levels of abstraction.
Have been used by statisticians for many years.
Have been applied to many cognitive problems:
causal reasoning (Mansinghka et al, 06)
language (Chater and Manning, 06)
vision (Fei-Fei, Fergus, Perona, 03)
word learning (Kemp, Perfors, Tenenbaum,06)
decision making (Lee, 06)
Outline : Outline A high-level view of HBMs
A case study
Semantic knowledge
Slide8 : Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) P(phrase structure | grammar) P(utterance | phrase structure) P(speech | utterance) P(grammar | UG)
Slide9 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Hierarchical Bayesian model P(G|U) P(s|G) P(u|s)
Slide10 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U A hierarchical Bayesian model specifies a joint distribution over all variables in the hierarchy: P({ui}, {si}, G | U)
= P ({ui} | {si}) P({si} | G) P(G|U) Hierarchical Bayesian model P(G|U) P(s|G) P(u|s)
Knowledge at multiple levels : Knowledge at multiple levels Top-down inferences:
How does abstract knowledge guide inferences at lower levels?
Bottom-up inferences:
How can abstract knowledge be acquired?
Simultaneous learning at multiple levels of abstraction
Slide12 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Top-down inferences Given grammar G and a collection of utterances, construct a phrase structure for each utterance.
Slide13 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Infer {si} given {ui}, G:
P( {si} | {ui}, G) α P( {ui} | {si} ) P( {si} |G) Top-down inferences
Slide14 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Bottom-up inferences Given a collection of phrase structures, learn a grammar G.
Slide15 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Infer G given {si} and U:
P(G| {si}, U) α P( {si} | G) P(G|U) Bottom-up inferences
Slide16 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Given a set of utterances {ui} and innate knowledge U, construct a grammar G and a phrase structure for each utterance. Simultaneous learning at multiple levels
Slide17 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Simultaneous learning at multiple levels A chicken-or-egg problem:
Given a grammar, phrase structures can be constructed
Given a set of phrase structures, a grammar can be learned
Slide18 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Infer G and {si} given {ui} and U:
P(G, {si} | {ui}, U) α P( {ui} | {si} )P({si} |G)P(G|U) Simultaneous learning at multiple levels
Slide19 : Phrase structure Utterance Grammar Universal Grammar u1 u2 u3 u4 u5 u6 s1 s2 s3 s4 s5 s6 G U Hierarchical Bayesian model P(G|U) P(s|G) P(u|s)
Knowledge at multiple levels : Knowledge at multiple levels Top-down inferences:
How does abstract knowledge guide inferences at lower levels?
Bottom-up inferences:
How can abstract knowledge be acquired?
Simultaneous learning at multiple levels of abstraction
Outline : Outline A high-level view of HBMs
A case study: Semantic knowledge
Folk Biology : Folk Biology R: principles S: structure D: data mouse squirrel chimp gorilla The relationships between living kinds are well described by tree-structured representations “Gorillas have hands”
Folk Biology : Folk Biology R: principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Outline : Outline A high-level view of HBMs
A case study: Semantic knowledge
Property induction
Learning structured representations
Learning the abstract organizing principles of a domain
Property induction : Property induction R: principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Property Induction : Property Induction R: Principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion Approach: work with the distribution P(D|S,R)
Property Induction : Property Induction Previous approaches: Rips (75), Osherson et al (90),
Sloman (93), Heit (98)
Bayesian Property Induction : Hypotheses Bayesian Property Induction
Bayesian Property Induction : Hypotheses Bayesian Property Induction
Slide30 : D C }
Choosing a prior : Choosing a prior
Bayesian Property Induction : Bayesian Property Induction A challenge:
We have to specify the prior, which typically includes many numbers
An opportunity:
The prior can capture knowledge about the problem.
Property Induction : Property Induction R: Principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion
Biological properties : Biological properties Structure:
Living kinds are organized into a tree
Stochastic process:
Nearby species in the tree tend to share properties
Slide35 : Structure:
Slide36 : Structure:
Stochastic Process : Smooth Not smooth Stochastic Process Nearby species in the tree tend to share properties.
In other words, properties tend to be smooth over the tree.
Stochastic process : Hypotheses Stochastic process
Generating a property : Generating a property y h where y tends to be smooth over the tree: threshold
Slide40 : S
The diffusion process : The diffusion process where Ө(yi) is 1 if yi ≥ 0 and 0 otherwise the covariance K encourages y to be smooth over the graph S
p(y|S,R): Generating a property : Let yi be the feature value at node i } i j p(y|S,R): Generating a property (Zhu, Lafferty, Ghahramani 03)
Biological properties : Biological properties R: Principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion Approach: work with the distribution P(D|S,R)
Slide44 : D C }
Results : Results (Osherson et al) Model Human
Results : Results Cows have property P.
Elephants have property P.
Horses have property P.
All mammals have property P. Model Human
Spatial model : Spatial model R: principles S: structure D: data mouse squirrel chimp gorilla Structural form: 2D space
Stochastic process: diffusion
Slide48 : Structure:
Slide49 : Structure:
Tree vs 2D : Tree vs 2D “horse” “all mammals” Tree + diffusion 2D + diffusion
Biological Properties : Biological Properties R: Principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion
Three inductive contexts : Class C Class A Class D Class E Class G Class F Class B Class C Class A Class D Class E Class G Class F Class B Class C Class G Class F Class E Class D Class B Class A Three inductive contexts R: S: tree +
diffusion process chain +
drift
process network +
causal
transmission “has T4 cells” “can bite
through wire” “carries E. Spirus bacteria”
Threshold properties : Threshold properties “can bite through wire”
“has skin that is more resistant to penetration than most synthetic fibers”
Hippo Cat Lion Camel Elephant Poodle Collie Doberman (Osherson et al; Blok et al)
Threshold properties : Threshold properties Structure:
The categories can be organized along a single dimension
Stochastic process:
Categories towards one end of the dimension are more likely to have the novel property
Results : Results “has skin that is more resistant to penetration than most synthetic fibers” (Blok et al, Smith et al) 1D + drift 1D + diffusion
Three inductive contexts : Class C Class A Class D Class E Class G Class F Class B Class C Class A Class D Class E Class G Class F Class B Class C Class G Class F Class E Class D Class B Class A Three inductive contexts R: S: tree +
diffusion process chain +
drift
process network +
causal
transmission “has T4 cells” “can bite
through wire” “carries E. Spirus bacteria”
Causally transmitted properties : Causally transmitted properties (Medin et al;
Shafto and Coley) Salmon Grizzly bear
Causally transmitted properties : Causally transmitted properties Structure:
The categories can be organized into a directed network
Stochastic process:
Properties are generated by a noisy transmission process
Experiment: disease properties : Experiment: disease properties Island Mammals (Shafto et al)
Results: disease properties : Results: disease properties Mammals Island Web +
transmission
Three inductive contexts : Class C Class A Class D Class E Class G Class F Class B Class C Class A Class D Class E Class G Class F Class B Class C Class G Class F Class E Class D Class B Class A Three inductive contexts R: S: tree +
diffusion process chain +
drift
process network +
causal
transmission “has T4 cells” “can bite
through wire” “carries E. Spirus bacteria”
Property Induction : Property Induction R: Principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion Approach: work with the distribution P(D|S,R)
Conclusions : property induction : Conclusions : property induction Hierarchical Bayesian models help to explain how abstract knowledge can be used for induction
Outline : Outline A high-level view of HBMs
A case study: Semantic knowledge
Property induction
Learning structured representations
Learning the abstract organizing principles of a domain
Structure learning : Structure learning R: Principles S: structure D: data Structural form: tree
Stochastic process: diffusion mouse squirrel chimp gorilla
Structure learning : Structure learning R: principles S: structure D: data ? Goal: find S that maximizes P(S|D,R) Structural form: tree
Stochastic process: diffusion
Structure learning : Structure learning R: principles S: structure D: data ? Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R) Structural form: tree
Stochastic process: diffusion
Structure learning : Structure learning R: principles S: structure D: data ? Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R) The distribution
previously used for property induction Structural form: tree
Stochastic process: diffusion
Generating features over the tree : mouse squirrel chimp gorilla Generating features over the tree
Generating features over the tree : mouse squirrel chimp gorilla Generating features over the tree
Structure learning : Structure learning R: principles S: structure D: data ? Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R) Structural form: tree
Stochastic process: diffusion
P(S|R): Generating structures : P(S|R): Generating structures Consistent with R Inconsistent with R
P(S|R): Generating structures : P(S|R): Generating structures Complex Simple
P(S|R): Generating structures : P(S|R): Generating structures if S inconsistent with R otherwise Each structure is weighted by the number of nodes it contains: where is the number of nodes in S
Structure Learning : Structure Learning P(S|D,R) will be high when:
The features in D vary smoothly over S
S is a simple graph (a graph with few nodes)
Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R) R: principles S: structure D: data
Structure Learning : Structure Learning P(S|D,R) will be high when:
The features in D vary smoothly over S
S is a simple graph (a graph with few nodes)
Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R) R: principles S: structure D: data
Structure learning example : Participants rated the goodness of 85 features for 48 animals
E.g., elephant:
gray hairless toughskin
big bulbous longleg
tail chewteeth tusks
smelly walks slow
strong muscle quadrapedal
inactive vegetation grazer
oldworld bush jungle
ground timid smart
group Structure learning example (Osherson et al)
Biological Data : Biological Data Features Animals
Slide79 : Tree:
Spatial model : Spatial model R: principles S: structure D: data mouse squirrel chimp gorilla Structural form: 2D space
Stochastic process: diffusion
Slide81 : 2D space:
Conclusions: structure learning : Conclusions: structure learning Hierarchical Bayesian models provide a unified framework for the acquisition and use of structured representations
Outline : Outline A high-level view of HBMs
A case study: Semantic knowledge
Property induction
Learning structured representations
Learning the abstract organizing principles of a domain
Learning structural form : Learning structural form R: principles S: structure D: data mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion
Which form is best? : Ostrich Robin Crocodile Snake Bat Orangutan Turtle Ostrich Robin Crocodile Snake Bat Orangutan Turtle Which form is best?
Structural forms : Structural forms Order Chain Ring Partition Hierarchy Tree Grid Cylinder
Learning structural form : Learning structural form R: principles S: structure D: data ? Goal: find S,F that maximize P(S,F|D) could be
tree,
2D space,
ring, ….
Structural form: F
Stochastic process: diffusion
Learning structural form : Learning structural form R: principles S: structure D: data ? Aim: find S,F that maximize P(S,F|D) α P(D|S)P(S|F) P(F) Uniform distribution on the set of forms Structural form: F
Stochastic process: diffusion
Learning structural form : Learning structural form R: principles S: structure D: data ? Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F) The distribution used for property induction Structural form: F
Stochastic process: diffusion
Learning structural form : Learning structural form R: principles S: structure D: data ? Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F) Structural form: F
Stochastic process: diffusion The distribution used for structure learning
P(S|F): Generating structures from forms : P(S|F): Generating structures from forms if S inconsistent with F otherwise Each structure is weighted by the number of nodes it contains: where is the number of nodes in S
Slide92 : Simpler forms are preferred A B C P(S|F): Generating structures from forms D All possible
graph structures S P(S|F) Chain Grid
Learning structural form : Learning structural form F: form S: structure D: data ? Goal: find S,F that maximize P(S,F|D) ?
Learning structural form : Learning structural form P(S,F|D) will be high when:
The features in D vary smoothly over S
S is a simple graph (a graph with few nodes)
F is a simple form (a form that can generate only a few structures) F: form S: structure D: data Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)
Learning structural form : Learning structural form P(S,F|D) will be high when:
The features in D vary smoothly over F
S is a simple graph (a graph with few nodes)
F is a simple form (a form that can generate only a few structures) F: form S: structure D: data Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)
Form learning: Biological Data : Form learning: Biological Data Features Animals 33 animals, 110 features
Form learning: Biological Data : Form learning: Biological Data
Supreme Court (Spaeth) : Supreme Court (Spaeth) Votes on 1600 cases (1987-2005)
Color (Ekman) : Color (Ekman)
Outline : Outline A high-level view of HBMs
A case study: Semantic knowledge
Property induction
Learning structured representations
Learning the abstract organizing principles of a domain
Where do priors come from? : Where do priors come from?
Slide102 : mouse squirrel chimp gorilla Stochastic process: diffusion
Slide103 : mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion
Slide104 : mouse squirrel chimp gorilla Structural form: tree
Stochastic process: diffusion
Where do structural forms come from? : Order Chain Ring Partition Hierarchy Tree Grid Cylinder Where do structural forms come from?
Where do structural forms come from? : Where do structural forms come from? Form Form Process Process
Node-replacement graph grammars : Node-replacement graph grammars Production
(Chain) Derivation
Node-replacement graph grammars : Node-replacement graph grammars Production
(Chain) Derivation
Node-replacement graph grammars : Node-replacement graph grammars Production
(Chain) Derivation
Where do structural forms come from? : Where do structural forms come from? Form Form Process Process
The complete space of grammars : The complete space of grammars 1 4096 ... ...
When can we stop adding levels? : When can we stop adding levels? When the knowledge at the top level is simple or general enough that it can be plausibly assumed to be innate.
Conclusions : Conclusions Hierarchical Bayesian models provide a unified framework which can
Explain how abstract knowledge is used for induction
Explain how abstract knowledge can be acquired
Learning abstract knowledge : Learning abstract knowledge Applications of hierarchical Bayesian models at this conference:
Semantic knowledge: Schmidt et al.
Learning the M-constraint
Syntax: Perfors et al.
Learning that language is hierarchically organized
Word learning: Kemp et al.
Learning the shape bias