第11章：Causal Inference and Machine Learning – Deep Learning, NLP, and Beyond

作者

本章由 Aleksander Molak 撰写。涉及的核心方法参考 Shalit et al. (2017) TARNet 原始论文、Curth & van der Schaar (2021a, 2021b) SNet/FlexTENet/OffsetNet、Veitch et al. (2020) CausalBert 原始论文、Abadie & Gardeazabal (2003) Synthetic Control 原始论文、Abadie (2021) 现代综述。代码使用 CATENets（PyTorch 实现，Curth & van der Schaar 的 van der Schaar Lab）、CausalBert（Reid Pryzant 的 PyTorch 实现）、CausalPy（PyMC Labs）。本章是 Part 2 的收尾，Part 3 因果发现的铺垫。

内容概述

本章把 Part 2 的工具箱扩展到深度学习 + 自然语言 + 时间序列三个前沿方向。内容分三块：（1）TARNet & SNet（Shalit et al. 2017; Curth & van der Schaar 2021a）——共享表示 + 分离 heads 的神经网络 CATE 估计器，结合 S-Learner 数据效率 + T-Learner 灵活性；CATENets 库实现 + 20 维 non-linear synthetic data 上 TARNet 表现最好（MAPE 2.49%）；（2）CausalBert & causal NLP（Veitch et al. 2020）——预训练 DistilBERT + 适配 outcome/treatment heads + joint training，让文本嵌入保留因果相关信息；manga Reddit-like dataset 上 ATE 估计 -0.62（真实 -0.7）；（3）Bayesian synthetic control（CausalPy / PyMC）——用 Dirichlet prior 约束权重和为 1 的合成控制；Elon Musk 收购 Twitter 后 Google 搜索量分析。三个内容共同主题：因果框架的现代扩展仍建立在 Ch 1–10 的核心概念上（SCMs、estimands、d-separation）。

核心方程与概念

TARNet（Treatment-Agnostic Representation Network, Shalit et al. 2017）：
架构：输入 $X$ → 共享表示层（$\Phi$）→ 2 个 disjoint regression heads（$h_1$ for $T=1$, $h_0$ for $T=0$）+ propensity head (optional)。
训练：所有数据通过 $\Phi$，treatment-specific 数据只通过对应 head。最小化： $$\mathcal{L} = \frac{1}{N} \sum_i \ell(y_i, h_{T_i}(\Phi(X_i)))$$
CATE 估计：$\hat{\tau}_i = h_1(\Phi(X_i)) - h_0(\Phi(X_i))$。
优势：shared $\Phi$ 提升数据效率；disjoint heads 强制 treatment 不可忽略（解决 S-Learner "treatment ignored" 问题）。
可选 discrepancy term（$\alpha$-Integral Probability Metric）：penalize $\Phi(X | T=1)$ 与 $\Phi(X | T=0)$ 分布差异——类似 propensity score 的几何平衡。
SNet（Curth & van der Schaar 2021a）：TARNet + DragonNet + DR-CFR 的统一推广。
5 个并行早期表示（每个负责不同 aspect）→ 3 个 disjoint heads（2 outcome + 1 propensity）。
正则化：enforce 不同 layers 的输入 orthogonal——避免表示重复。
作者数据上 TARNet > SNet（MAPE 2.49% vs SNet 较高）——SNet 的额外灵活性可能需要更大样本或更长训练。

CATENets 库（Curth & van der Schaar 2021）：JAX + PyTorch 双实现；含 TARNet / SNet / FlexTENet / OffsetNet；sklearn-style API：

from catenets.models.torch import TARNet, SNet
tarnet = TARNet(n_unit_in=X.shape[1], binary_y=False, n_units_out_prop=32, n_units_r=8, nonlin='selu')
tarnet.fit(X=X[:TRAIN_SIZE], y=y[:TRAIN_SIZE], w=T[:TRAIN_SIZE])
effect_pred = tarnet.predict(X=X[TRAIN_SIZE:]).cpu().detach().numpy()

20 维 non-linear synthetic data（DGP）： $$Y_i = (50 T_i \cdot |X_i^{(0)}|)^{1.2} + \sum_{d=1}^{D} w^{(d)} X_i^{(d)}$$ 其中 $w^{(d)} \sim \text{Gumbel}(5, 10)$，$D = 19$。关键：
non-additive non-linear treatment effect（$|X^{(0)}|^{1.2}$）。
只有 1/20 特征（$X^{(0)}$）与 treatment 交互——高维 + 稀疏交互。
CATE 模型在低密度区域（$|X^{(0)}| > 1$）最难——Causal Forest 在低密度区域表现差，TARNet 最好。
PyTorch 训练循环（在作者数据上）：
批大小：8（适合小数据）。
优化器：Adam，default learning rate。
Loss 组合：MSE for outcome + cross-entropy for treatment（SNet）。
Reproducibility：pl.seed_everything(SEED) 给"weak reproducibility"——完全可复现需要 torch.use_deterministic_algorithms(True) + 冻结 dataloader 随机性（PyTorch 当前仍未完全支持）。
Benchmark on 20-dim data（Table 11.3）：
TARNet: MAPE 2.49%（最佳）
Causal Forest DML: 低方差但低密度区域差
S-Learner / X-Learner: 与 TARNet 方差相当但低密度区域差
Linear DR-Learner: 失败（参数线性假设违反）
SNet: 高方差（作者认为可能需要更长训练 + 更大样本）
CATE 模型 benchmark 困难（Curth et al. 2021）——同一模型在不同 DGP 下表现差异大。
NLP 与因果的三种场景（Feder et al. 2022）：
Text as a treatment（如文案风格 → 销售额）：文本有多重特征，内部 confounding（同一文本的多个方面都影响 treatment + outcome）。
Text as an outcome（如写作课程 → 写作清晰度）：treatment 易随机化，但 outcome 测量难——outcome 模型训练在同一样本上违反 consistency（Egami et al. 2018）。
Text as a confounder（如 Reddit avatar 性别 → 帖子 upvote）：核心挑战——文本是 confounder 同时是 mediator 的一部分。
CausalBert（Veitch et al. 2020）：预训练 DistilBERT + 适配 outcome + propensity heads。
训练：joint loss = $w_Q \cdot Q$ loss + $w_g \cdot g$ (propensity) loss + $w_{MLM} \cdot \text{MLM}$ loss。
默认：$w_Q = 1.0, w_g = 0.05, w_{MLM} = 0.05$。
核心机制：adapt pretrained embedding 让其保留预测 treatment 和 outcome 的信息——causally sufficient embedding。
作者数据：manga Reddit-like dataset，221 观测，5 特征（text, subreddit, female_avatar, has_photo, upvote）。ATE 估计 = -0.623（真实 = -0.7，误差 11%）。
DAG 假设：text 不能是 treatment 和 outcome 的后代（collider 风险）——若用户在 15 分钟内编辑帖子（观察到 lack of upvote 后），text 变 collider。
CausalBert 的"安全模式"：
假设：(a) 无 unobserved confounding；(b) positivity；(c) consistency。
collider 风险：text 是 confounder 还是 collider？判别：检查 text 是否为 $T$ 或 $Y$ 的后代。
Reddit 编辑例子：用户 15 分钟内编辑（$Y$ 影响 text）→ text 变 collider。
CausalBert 与 meta-SCM 假说（Willig et al. 2023）：LLM 从语言数据中学习"meta-SCM"——能做一些因果推理但泛化能力有限。GPT-4 在 CRASS 反事实推理 benchmark 上 92.44% 准确率（人类 98.18%）（Kıcıman et al. 2023）——接近但未超过。
Quasi-experiments vs RCT：quasi-experiments 利用自然发生的事件（如 Twitter 收购、政策变化）——没有随机化，但有时能提供因果信息。准实验方法通常处理时间序列。
Synthetic control（Abadie & Gardeazabal 2003; Abadie 2021）：
核心：用donor pool（在 pre-treatment 与 treated unit 高度相关的对照单位）预测 treated unit 的反事实 outcome。
约束：$w_i \in [0, 1]$，$\sum_i w_i = 1$——避免外推 / 过拟合。
经典形式：constrained optimization 求 $\min_w \sum_{t < T_0} (Y_t^{\text{treated}} - \sum_i w_i Y_t^{\text{donor}_i})^2$。
Reichenbach's common cause principle（1956）：若两变量相关，必有 (a) 直接因果关系或 (b) 共同原因——synthetic control 依赖此原则。
Bayesian synthetic control（CausalPy / PyMC）：
用 Dirichlet prior 约束权重：$w \sim \text{Dirichlet}(\alpha)$——自动满足 $w_i \geq 0, \sum w_i = 1$。
模型：$Y_t^{\text{treated}} \sim \mathcal{N}(\sum_i w_i Y_t^{\text{donor}_i}, \sigma^2)$，$w \sim \text{Dirichlet}(\alpha)$。
优势：用 Bayesian 框架自然量化不确定性（94% HDI for coefficients）。
限制：donor 池需要变量值跨越 treated unit 的范围——若所有 donor 都 < treated 或都 > treated，synthetic control 无法工作。
Elon Musk × Twitter 案例：
Treatment date：2022-10-28 Musk's "The bird is freed" tweet。
Data：Google Trends 181 days，166 pre-treatment + 15 post-treatment。
Donor pool：TikTok、LinkedIn、Instagram 的 Google search volume。
公式：twitter ~ 0 + tiktok + linkedin + instagram（无截距——与 synthetic control 约束一致）。
结果：instagram 系数 0.84（最大权重，HDI [0.81, 0.87]）；R² = 38.6%（pre-treatment fit）→ Musk 的 tweet 显著提升 Twitter 搜索量。
挑战：donor pool 仅 3 个（Abadie 2021 建议 5-25）；低 R² 表明模型不完美；其他 confounders（如媒体报道）未控制。
CausalPy 的 5 大方法（PyMC Labs）：synthetic control + interrupted time series + difference-in-differences + regression discontinuity。统一接口 pymc_experiments.Xxx + pymc_models.XxxFitter。
公平性与因果（Plečko & Bareinboim 2023）：观察到的"性别引用差距"是 confounding-driven，结构性分析必须先于结论——Pearl 因果框架对 fairness 是必要工具。

关键结论

TARNet 在高维 + 稀疏交互的 CATE 估计上表现最佳（作者数据 MAPE 2.49%）——shared representation + disjoint heads 的设计哲学战胜了纯 S/T-Learner。
CATE benchmark 是难题（Curth et al. 2021）——同一模型在不同 DGP 下表现差异大。生产中应：(i) 跑多组 simulation；(ii) 与简单模型对比；(iii) 报告 uncertainty。
CausalBert 开启因果 NLP 的工程化——通过 adapt pretrained embeddings 让文本嵌入causally sufficient（保留 treatment + outcome 预测信息）。作者数据上 ATE 误差 11%——对小数据 + 简单 DGP 可用。
LLM 的因果能力有限（Willig et al. 2023）——GPT-4 虽能正确回答一些反事实问题，但学的是 correlational "meta-SCM" 而非真正的因果模型。生产中 LLM 应作为辅助工具而非因果推理引擎。
Synthetic control 是"无 RCT 时的银弹"——对自然实验（政策变化、产品发布、收购）极有用。关键限制：donor pool 必须与 treated unit 在 pre-treatment 强相关，且包含跨越 treated unit 值范围的多样性。
Bayesian synthetic control（CausalPy）提供天然 uncertainty 量化——Dirichlet prior 自动满足约束 + 94% HDI 作为 Bayesian analog of CI。

挑战和开放性问题

TARNet/SNet 的 reproducibility（PyTorch 局限）：完全 deterministic 训练需要修改 PyTorch 内部 + CATENets 代码。生产中应记录 multiple seed 的结果范围。
CausalBert 的 collider 风险：text 可能在某些场景下是 treatment 或 outcome 的后代——判别困难，需要 domain expertise。
CATE benchmark 不可信（Curth et al. 2021）：single simulation 排名不可靠。生产中应跑 multiple simulation + 真实世界干预对照。
LLM 的因果"幻觉"（Kıcıman et al. 2023）：高 accuracy 不代表 systematic causal reasoning——失败模式难以预测。生产中 LLM 应只用于"提取 confounder 候选"等辅助任务。
Synthetic control 的 donor pool 选择：Abadie 2021 建议 5-25 变量；太少 → 低 R²；太多 → 过拟合风险 + 假阳性。没有金标准。
CausalPy 在 small data 上的低 R²：作者数据 R² = 38.6%——Twitter 搜索量受太多未观测因素驱动。真实场景中 R² < 50% 是常态。
CausalBert 的"adapt 嵌入可能放大小样本偏差"：joint training + small N（作者 221）可能让 embedding overfit。生产中应 $N > 1000$。
GPT-4 的 meta-SCM 是否"真因果"：当前学界有争议。一种观点是"correlational pattern matching"，另一种是"语法结构反映了某种潜在因果"。生产中应把 LLM 的因果判断视为"hypothesis"而非"verdict"。

个人反思与批判性分析

本章是 Part 2 的"前沿扩展"——把因果工具箱扩展到 DL、NLP、时间序列三个 2020s 热点。值得讨论的几个层面：

TARNet 的"架构 vs 数据"哲学：作者数据 TARNet 表现最好，但这不是 TARNet 本身的优势——而是 (a) shared representation 适合高维 + 大量无信息特征；(b) disjoint heads 适合稀疏交互；(c) 神经网络拟合 non-linear 的能力。生产中应：(a) 先分析数据 DGP；(b) 选匹配 DGP 的架构；(c) 不要为 DL 而 DL。
CausalBert 的"预训练 + 适配"工程哲学：把 NLP 领域成熟的预训练模型嫁接到因果任务——这一思路与 DECI（Ch 14）、Causal Attention 等前沿方向一致。优势：复用大规模语言模型的能力；风险：pretrained 嵌入可能含未识别的 confounding（如 GPT 嵌入可能 encode 社会偏见——这对 fairness 是灾难）。
LLM 的"meta-SCM"假说 vs 真正因果推理：Willig et al. 2023 提出 LLM 学的是"correlational meta-SCM"——能正确回答一些因果问题但不能 systematic 推理。Kıcıman et al. 2023 显示 GPT-4 在 CRASS 上 92.44% vs 人类 98.18%——接近但未超过。生产建议：(a) 不要用 LLM 做关键因果判断；(b) 用 LLM 做"候选 confounder 提取" + 人类专家 review；(c) 用 LLM 解释 final CATE 结果的可读性。
Causal fairness 的"性别引用差距"是 confound-driven：Plečko & Bareinboim 2023 强调——结构性分析必须先于政策建议。这与 Ch 1 的 Simpson 悖论思路一致：对观察数据下"XX 群体有 Y 效应"的结论前先画因果图。
Synthetic control 的"donor pool 工程学"：Twitter 案例用 3 个 social media 平台作 donor——为什么是这 3 个？作者未深入讨论，但 Abadie 2021 给出更系统的 donor pool 选择方法（pre-treatment fit + parallel trends assumption）。生产中：(a) 先用大量潜在 donors 预筛（pre-treatment 相关性 > 0.5）；(b) 排除"受 treatment 影响"的单位（违反 SUTVA）；(c) 跑多个 donor pool sizes，看 effect 估计稳定性。
CausalPy 的"R² = 38.6% 但 effect 显著"的有趣现实：低 R² 不代表因果效应估计无效——只要 pre-treatment 拟合给出 donor 权重的方向性信息，post-treatment 的 counterfactual 仍可估计。生产中：报告"pre-treatment R² + post-treatment cumulative effect + 94% HDI"——而不只报 R²。
从 SCMS 到 LLMs 的"因果光谱"：本书从 Pearl 1990s 的纯 SCM（符号逻辑）→ 2020s 的 LLM（语言数据驱动的 correlational pattern matching）——这是一个因果概念的工程化退让。Pearl 严格坚持"因果 ≠ 关联"，但工程上"近似因果"在 LLM 时代成为必要妥协。未来方向：DECI（Ch 14）把 end-to-end 因果发现 + 干预 + 预测集成在 LLM 框架内，是这一趋势的代表。
对个人研究的启发：我在做血管生物力学时，TARNet 风格的神经网络可以用于"基于影像组学 + 临床特征估计不同治疗方案的 HTE"——但需要 (a) 大样本（$N > 1000$）；(b) 显式 causal graph（专家 + 因果发现）；(c) sensitivity analysis。Synthetic control则适用于"如果 5 年前推出某治疗方案，病人现在的血管状态会如何"——但医学伦理不允许真做实验，合成控制是少有的"准实验"工具。
作者的"market-style 比喻" 风格：Hanna（文案）、Yìzé（写作者）、Catori 与 Stephen（Reddit manga）——三个"虚拟人物"贯穿 NLP 章，让技术叙述有温度。这种叙事风格在教科书写作中是罕见的"以用户为中心"——更接近企业级工程博客而非传统学术教材。

重要参考文献

[X1] Shalit, U., Johansson, F. D., & Sontag, D. (2017). Estimating individual treatment effect: generalization bounds and algorithms. ICML 2017, 3076–3085 — TARNet 的开创性论文。
[X2] Curth, A., & van der Schaar, M. (2021a). Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms. AISTATS 2021 — SNet 与非参数 CATE 理论。
[X3] Curth, A., & van der Schaar, M. (2021b). On Inductive Biases for Heterogeneous Treatment Effect Estimation. NeurIPS 2021 — FlexTENet / OffsetNet。
[X4] Veitch, V., Sridhar, D., & Blei, D. (2020). Adapting text embeddings for causal inference. UAI 2020, 919–928 — CausalBert 原始论文。
[X5] Shi, C., Blei, D., & Veitch, V. (2019). Adapting neural networks for the estimation of treatment effects. NeurIPS 2019 — DragonNet，TARNet 风格。
[X6] Hassanpour, N., & Greiner, R. (2020). Learning disentangled representations for counterfactual regression. ICLR 2020 — DR-CFR，神经网络版本。
[X7] Abadie, A., & Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. Public Choice & Political Economy Journal — Synthetic control 原始论文。
[X8] Abadie, A. (2021). Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects. Journal of Economic Literature, 59(2), 391–425 — Synthetic control 现代综述。
[X9] Curth, A., Svensson, D., Weatherall, J., & van der Schaar, M. (2021). Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect Estimation. NeurIPS Datasets and Benchmarks — CATE benchmark 的批判性回顾。
[X10] Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., Yang, D. (2022). Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. TACL, 10, 1138–1158 — 因果 NLP 的综合 review。
[X11] Kıcıman, E., Ness, R., Sharma, A., & Tan, C. (2023). Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. arXiv:2305.00050 — LLM 因果推理能力的最新评估。
[X12] Willig, M., Zečević, M., Dhami, D. S., & Kersting, K. (2023). Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. arXiv — LLM "meta-SCM" 假说。
[X13] Zhang, C., et al. (2023). Understanding Causality with Large Language Models: Feasibility and Opportunities. arXiv — Microsoft 团队的 LLM 因果评估。
[X14] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv — Word2Vec，NLP 词向量的里程碑。
[X15] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv — BERT 原始论文。
[X16] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv — CausalBert 使用的轻量 BERT。
[X17] Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017 — Transformer 架构。
[X18] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Tech Report — GPT-1。
[X19] Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020 — GPT-3。
[X20] OpenAI. (2023). GPT-4 Technical Report. arXiv — GPT-4。
[X21] Paszke, A., et al. (2017). Automatic differentiation in PyTorch. NeurIPS Autodiff Workshop — PyTorch 原始论文。
[X22] Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-Normalizing Neural Networks. NeurIPS 2017 — SELU 激活函数。
[X23] Bradbury, J., et al. (2018). JAX: Composable Transformations of Python+NumPy Programs. http://github.com/google/jax — CATENets 的底层。
[X24] Pryzant, R., Shen, K., Jurafsky, D., & Wagner, S. (2018). Deconfounded Lexicon Induction for Interpretable Social Science. NAACL 2018 — CausalBert 的早期工作。
[X25] Pryzant, R., Card, D., Jurafsky, D., Veitch, V., & Sridhar, D. (2021). Causal Effects of Linguistic Properties. NAACL 2021 — text as treatment 的代表工作。
[X26] Pryzant, R., Chung, Y., & Jurafsky, D. (2017). Predicting Sales from the Language of Product Descriptions. eCOM@SIGIR — text 影响销售的代表研究。
[X27] Zhang, J., Mullainathan, S., & Danescu-Niculescu-Mizil, C. (2020). Quantifying the Causal Effects of Conversational Tendencies. ACM HCI, 4(CSCW2) — 对话风格因果效应。
[X28] Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M. (2018). How to make causal inferences using texts. arXiv — text as outcome 的代表。
[X29] Plečko, D., & Bareinboim, E. (2023). Causal Fairness Analysis. arXiv — 因果公平的现代框架。
[X30] Dworkin, J. D., Linn, K. A., Teich, E. G., Zurn, P., Shinohara, R. T., & Bassett, D. S. (2020). The extent and drivers of gender imbalance in neuroscience reference lists. Nature Neuroscience, 23(8), 918–926 — 性别引用差距代表研究。
[X31] Benjamens, S., Banning, L. B. D., van den Berg, T. A. J., & Pol, R. A. (2020). Gender Disparities in Authorships and Citations in Transplantation Research. Transplantation Direct, 6(11), e614 — 性别引用差距的医学案例。
[X32] Caplar, N., Tacchella, S., & Birrer, S. (2017). Quantitative evaluation of gender bias in astronomical publications from citation counts. Nature Astronomy, 1(6), 0141 — 性别引用差距的天文学案例。
[X33] Wittgenstein, L. (1922). Tractatus Logico-Philosophicus. Harcourt, Brace & Company — 本章 Wittgenstein 引文来源。
[X34] Wittgenstein, L. (1953). Philosophical Investigations. Macmillan — "meaning is use" 的来源。
[X35] Firth, J. (1957). A Synopsis of Linguistic Theory, 1930–55. — "you shall know a word by the company it keeps"。
[X36] Molino, P., & Tagliabue, J. (2023). Wittgenstein's influence on artificial intelligence. arXiv — Wittgenstein → Masterman → Cambridge Language Research Unit 的影响链。
[X37] Reichenbach, H. (1956). The Direction of Time. University of California Press — Reichenbach's common cause principle 的来源。
[X38] Ferman, B., Pinto, C., & Possebom, V. (2020). Cherry Picking with Synthetic Controls. J. Policy Analysis and Management, 39, 510–532 — Synthetic control 的"cherry picking"陷阱。
[X39] Chernozhukov, V., Wuthrich, K., & Zhu, Y. (2022). A t-test for synthetic controls. arXiv — Synthetic control 的统计检验。
[X40] Facure, M. (2020). Causal Inference for The Brave and True. https://matheusfacure.github.io/python-causality-handbook/ — Synthetic control 实用 Python 教程。
[X41] Gelman, A., Goodrich, B., Gabry, J., & Vehtari, A. (2018). R-squared for Bayesian regression models. The American Statistician — Bayesian R²。
[X42] Martin, O. A., Kumar, R., & Lao, J. (2021). Bayesian Modeling and Computation in Python. Chapman and Hall/CRC — PyMC 实战指南。
[X43] Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press — Multilevel models。
[X44] Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press — Synthetic control 实践案例集。
[X45] Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC — 全书的反事实 / 估计器理论背景。
[X46] Pearl, J. (2009). Causality: Models, Reasoning and Inference (2nd ed.). Cambridge University Press — 反事实三步法的来源。
[X47] Pearl, J., & Mackenzie, D. (2019). The Book of Why. Penguin Books — 因果阶梯的科普源。
[X48] Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press — ANM 等理论基础。
[X49] Peters, M. E., et al. (2018). Deep Contextualized Word Representations. NAACL 2018 — ELMo。
[X50] Rédei, M. (2002). Reichenbach's Common Cause Principle and Quantum Correlations. Springer — 量子力学中的 Reichenbach 原理。
[X51] Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal Inference in Statistics: A Primer. Wiley — textbook 基础。
[X52] Frohberg, J., & Binder, F. (2022). CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models. LREC 2022, 2126–2140 — CRASS benchmark。