# Probabilistic Graphical Models

## Statistical and Algorithmic Foundations of Deep Learning

Posted by JoselynZhao on May 12, 2020

# Probabilistic Graphical Models

## Statistical and Algorithmic Foundations of Deep Learning

Author: Eric Xing

## 01 An overview of DL components

### Modern building blocks: units, layers, activations functions, loss functions, etc.

• Linear and ReLU
• Sigmoid and tanh
• Etc.

• Fully connected
• Convolutional & pooling
• Recurrent
• ResNets
• Etc. -

Feature learning 成功学习中间表示[Lee et al ICML 2009，Lee et al NIPS 2009] 表示学习：网络学习越来越多的抽象数据表示形式，这些数据被“解开”，即可以进行线性分离。

## 02 Similarities and differences between GMs and NNs

### Graphical models vs. computational graphs

Graphical models:

• 用于以图形形式编码有意义的知识和相关的不确定性的表示形式
• 学习和推理基于经过充分研究(依赖于结构)的技术(例如EM，消息传递，VI，MCMC等)的丰富工具箱
• 图形代表模型 Utility of the graph
• 一种用于从局部结构综合全局损失函数的工具(潜在功能，特征功能等)
• 一种设计合理有效的推理算法的工具(总和，均值场等)
• 激发近似和惩罚的工具(结构化MF，树近似等)
• 用于监视理论和经验行为以及推理准确性的工具

Utility of the loss function

• 学习算法和模型质量的主要衡量指标

Deep neural networks :

• 学习有助于最终指标上的计算和性能的表示形式(中间表示形式不保证一定有意义)
• 学习主要基于梯度下降法(aka反向传播)；推论通常是微不足道的，并通过“向前传递”完成
• 图形代表计算

Utility of the network

• 概念上综合复杂决策假设的工具(分阶段的投影和聚合)
• 用于组织计算操作的工具(潜在状态的分阶段更新)
• 用于设计加工步骤和计算模块的工具(逐层并行化)
• 在评估DL推理算法方面没有明显的用途

• 玻尔兹曼机器Boltzmann machines (Hinton＆Sejnowsky，1983)
• 受限制的玻尔兹曼机器Restricted Boltzmann machines(Smolensky，1986)
• Sigmoid信念网络的学习和推理Learning and Inference in sigmoid belief networks(Neal，1992)
• 深度信念网络中的快速学习Fast learning in deep belief networks(Hinton，Osindero，Teh，2006年)
• 深度玻尔兹曼机器Deep Boltzmann machines(Salakhutdinov和Hinton，2009年)

I: Restricted Boltzmann Machines 受限玻尔兹曼机器，缩写为RBM。 RBM是用二部图(bi-partite graph)表示的马尔可夫随机场，图的一层/部分中的所有节点都连接到另一层中的所有节点； 没有层间连接。 联合分布为： 单个数据点的对数似然度(不可观察的边际被边缘化)： 对数似然比的梯度 模型参数： 对数似然比的梯度 参数(替代形式)： 两种期望都可以通过抽样来近似， 从后部采样是准确的(RBM在给定的h上分解)。 通过MCMC从关节进行采样(例如，吉布斯采样)

• 计算第一项称为钳位/唤醒/正相(网络是“清醒的”，因为它取决于可见变量)
• 计算第二项称为非固定/睡眠/自由/负相(该网络“处于睡眠状态”，因为它对关节的可见变量进行了采样；比喻，它梦见了可见的输入)

II: Sigmoid Belief Networks Sigimoid信念网是简单的贝叶斯网络，其二进制变量的条件概率由Sigmoid函数表示： 贝叶斯网络表现出一种称为“解释效应”的现象：如果A与C相关，则B与C相关的机会减少。 ⇒在给定C的情况下A和B相互关联。 值得注意的是， 由于“解释效应”，当我们以信念网络中的可见层为条件时，所有隐藏变量都将成为因变量。

### Sigmoid Belief Networks as graphical models

• 我们可以在第一阶段从后验中精确采样
• 我们运行吉布斯块抽样，以从联合分布中近似抽取样本

### Deep Belief Networks and Boltzmann Machines

III: Deep Belief Nets DBN是混合图形模型(链图)。其联合概率分布可表示为：

• 贪婪的预训练+临时微调； 没有适当的联合训练
• 近似推断为前馈(自下而上)

Layer-wise pre-training

• 预训练并冻结第一个RBM
• 在顶部堆叠另一个RBM并对其进行训练
• 重物2层以上的重物保持绑紧状态
• 我们重复此过程：预训练和解开

Fine-tuning

• Pre-training is quite ad-hoc(特别指定) and is unlikely to lead to a good probabilistic model per se
• However, the layers of representations could perhaps be useful for some other downstream tasks!
• We can further “fine-tune” a pre-trained DBN for some other task

Setting A: Unsupervised learning (DBN → autoencoder)

1. Pre-train a stack of RBMs in a greedy layer-wise fashion
2. “Unroll” the RBMs to create an autoencoder
3. Fine-tune the parameters by optimizing the reconstruction error(重构误差)

Setting B: Supervised learning (DBN → classifier)

1. Pre-train a stack of RBMs in a greedy layer-wise fashion
2. “Unroll” the RBMs to create a feedforward classifier
3. Fine-tune the parameters by optimizing the reconstruction error

Deep Belief Nets and Boltzmann Machines DBMs are fully un-directed models (Markov random fields). Can be trained similarly as RBMs via MCMC (Hinton & Sejnowski, 1983). Use a variational approximation(变分近似) of the data distribution for faster training (Salakhutdinov & Hinton, 2009). Similarly, can be used to initialize other networks for downstream tasks

A few ==critical points== to note about all these models:

• The primary goal of deep generative models is to represent the distribution of the observable variables. Adding layers of hidden variables allows to represent increasingly more complex distributions.
• Hidden variables are secondary (auxiliary) elements used to facilitate learning of complex dependencies between the observables.
• Training of the model is ad-hoc, but what matters is the quality of learned hidden representations.
• Representations are judged by their usefulness on a downstream task (the probabilistic meaning of the model is often discarded at the end).
• In contrast, classical graphical models are often concerned with the correctness of learning and inference of all variables

Conclusion

• DL & GM: the fields are similar in the beginning (structure, energy, etc.), and then diverge to their own signature pipelines
• DL: most effort is directed to comparing different architectures and their components (models are driven by evaluating empirical performance on a downstream tasks)
• DL models are good at learning robust hierarchical representations from the data and suitable for simple reasoning (call it “low-level cognition”)
• GM: the effort is directed towards improving inference accuracy and convergence speed
• GMs are best for provably correct inference and suitable for high-level complex reasoning tasks (call it “high-level cognition”) 推理任务
• Convergence of both fields is very promising!

## 03 Combining DL methods and GMs

### Using outputs of NNs as inputs to GMs

Combining sequential NNs and GMs HMM：隐马尔可夫 Hybrid NNs + conditional GMs In a standard CRF条件随机场, each of the factor cells is a parameter. In a hybrid model, these values are computed by a neural network.

### GMs with potential functions represented by NNs q NNs with structured outputs

Using GMs as Prediction Explanations

!!!! How do we build a powerful predictive model whose predictions we can interpret in terms of semantically meaningful features?

#### Contextual Explanation Networks (CENs)

• The final prediction is made by a linear GM.
• Each coefficient assigns a weight to a meaningful attribute.
• Allows us to judge predictions in terms of GMs produced by the context encoder.

CEN: Implementation Details Workflow:

• Maintain a (sparse稀疏) dictionary of GM parameters.
• Process complex inputs (images, text, time series, etc.) using deep nets; use soft attention to either select or combine models from the dictionary. • Use constructed GMs (e.g., CRFs) to make predictions. • Inspect GM parameters to understand the reasoning behind predictions.

Results: imagery as context Based on the imagery, CEN learns to select different models for urban and rural

Results: classical image & text datasets CEN architectures for survival analysis

## 04 Bayesian Learning of NNs

### Bayesian learning of NN parameters q Deep kernel learning

A neural network as a probabilistic model: Likelihood: $p(y|x, \theta)$

• Categorical distribution for classification ⇒ cross-entropy loss 交叉熵损失
• Gaussian distribution for regression ⇒ squared loss平方损失
• Gaussianprior⇒L2regularization
• Laplaceprior⇒L1regularization

Bayesian learning [MacKay 1992, Neal 1996, de Freitas 2003]