第07章：The Four-Step Process of Causal Inference

作者

本章由 Aleksander Molak 撰写。涉及 DoWhy（Sharma & Kiciman 2020）、EconML（Battochi et al. 2019）两个核心库；DML（Double Machine Learning）方法引自 Chernozhukov et al. (2016) 的开创性论文；GCM API 来自 Blobaum et al. (2022)。本章是 Ch 6 理论到 Ch 8 工程化假设的桥梁。

内容概述

本章把 Ch 1–6 的概念转化为可重复的工程流程——四步法因果推断：（1）建模（用 GML / networkx 编码假设）；（2）识别 estimand（DoWhy 自动调用 back-door / front-door / IV）；（3）计算估计（用 EconML DML、DoWhy front-door、DoWhy linear regression 等）；（4）refutation tests（Popperian 证伪——invariant / nullifying 变换）。内容分四块：（1）Python 因果生态综述（20+ 库：DoWhy / EconML / gCastle / CausalML / causal-learn / DoubleML / PySensemakr 等）；（2）DoWhy + EconML 的 API 集成（backdoor.econml.dml.DML 字符串命名空间）；（3）refutation tests 的哲学基础（Popper 1935 证伪主义）——CV 不适用于因果模型（rung 1 工具）的原因；（4）一个 5 节点（S, Q, X, Y, P）的 full example，含 GCM API（Blobaum et al. 2022）的 counterfactual 能力。

核心方程与概念

Python 因果生态（截至 2022-09）：作者列出 20 个活跃库，本书主要用：
DoWhy（py-why.github.io/dowhy）：DAG-based 因果推断框架；Microsoft + AWS 支持。
EconML（microsoft/EconML）：HTE 估计（meta-learners、DML、Causal Forest、DMLIV）；与 DoWhy 深度集成。
gCastle（Huawei Noah's Ark Lab）：因果发现综合库；Ch 13–14 使用。
causal-learn（CMU）：PC 算法等约束型因果发现的 Python 实现（基于 Java Tetrad）。
Causica（Microsoft Research）：DECI 端到端因果发现；Ch 14 使用。
DoubleML（docs.doubleml.org）：DML 的标准实现。
CausalML（Uber）：uplift modeling；可与 DoWhy 集成。
CATENets（Curth et al.）：神经网络 CATE 估计器（JAX/PyTorch）；Ch 11 使用。
CausalPy（PyMC Labs）：quasi-experimental（synthetic control, ITS, DiD, RD）。
PySensemakr（Cinelli & Hazlett 风格）：因果敏感性分析。
其他：causalinference, causallib, CDT, Differences, GRAPL, LiNGAM, scikit-uplift, Semopy, tfcausalimpact, YLearn。
四步法：
Model：用 GML / networkx 定义因果图。
Identify：用 back-door / front-door / IV 找 estimand。
Estimate：用 linear regression / DML / front-door two-stage regression / etc. 算估计值。
Refute：用 random common cause / placebo / data subset / etc. 检验稳健性。
DoWhy 0.8 API zoo：
CausalModel（主 API）：基于 DAG 的完整流程控制。
Pandas API（高层）：直接在 DataFrame 上做干预。
GCM API（实验性）：2-3 行代码完成因果效应估计 + 干预 + 反事实 + 归因（attribution）。
GPSMemorySCM 复用（Ch 6 SCM）：$X = $ GPS, $Z = $ hippocampus, $Y = $ memory, $U = $ motivation（unobserved）。GML 字符串定义 4 节点 4 边：
```
graph [
  directed 1
  node [id "X" label "X"]
  node [id "Z" label "Z"]
  node [id "Y" label "Y"]
  node [id "U" label "U"]
  edge [source "X" target "Z"]
  edge [source "Z" target "Y"]
  edge [source "U" target "X"]
  edge [source "U" target "Y"]
]
```
DoWhy 看到 $U$ 后自动识别：front-door 是唯一可行的 estimand（back-door 因 $U$ 不可观测而失败；IV 因无合适工具而失败）。

CausalModel 对象构造：

model = CausalModel(
  data=df, treatment='X', outcome='Y', graph=gml_graph
)
model.view_model()  # 可视化

自动 estimand 识别：

estimand = model.identify_effect()
print(estimand)
# 输出:
# Estimand type: nonparametric-ate
# Estimand 1: backdoor — No such variable(s) found!
# Estimand 2: iv — No such variable(s) found!
# Estimand 3: frontdoor
#   expression: Expectation(Derivative(Y,[Z]) * Derivative([Z],[X]))
#   assumption 1: Full-mediation (Z 截断 X→Y 所有有向路径)
#   assumption 2: First-stage-unconfoundedness (P(Z|X,U) = P(Z|X))
#   assumption 3: Second-stage-unconfoundedness (P(Y|Z,X,U) = P(Y|Z,X))

Front-door 估计：

estimate = model.estimate_effect(
  identified_estimand=estimand,
  method_name='frontdoor.two_stage_regression'
)
print(f'Estimate: {estimate.value}')  # → -0.4201（vs 真实 -0.42）

DoWhy 内部实现 Ch 6 的"两步线性回归 + 系数相乘"。

Refutation tests（Popperian 证伪）（Sharma & Kiciman 2020）：
Invariant transformations：变换数据，估计值应不变。
- data_subset_refuter：随机删 $1 - $ subset_fraction 比例数据，再估计。
- random_common_cause：加一个独立的随机 confounder（$X \leftarrow W \to Y$），再估计。
Nullifying transformations：变换数据，估计值应为零。
- placebo_treatment_refuter：把 treatment 换成随机 placebo。

例：GPSMemorySCM data_subset_refuter(subset_fraction=0.4) 输出 New effect: -0.4197, p=0.98——估计几乎不变，p 值高 → 不拒绝原假设。

CV 不适用于因果模型：Bates, Hastie & Tibshirani (2021) 已经证明 CV 估计的 "generalization error" 含义复杂；对因果模型，CV 评估的只是 rung-1 预测能力，不能识别 rung-2 因果方向。同一个 Markov 等价类（MEC）内的多个 DAG 可能产生相同的观测分布，CV 上都表现良好——但因果方向不同。CV 对因果评估只能用于已知因果结构时评估 estimator 的有限样本偏差（Ch 10 详述），不能用于识别因果图。
Popper 证伪主义（Popper 1935/1959）：科学理论不能被证实（induction 不可靠，呼应 Ch 1 Hume），但可以被证伪——只要找到一个反例。Refutation tests 的哲学基础：因果模型是"对真实数据生成过程的微观理论"，refute tests 修改数据/模型，看估计是否崩溃。
Full example (5 节点 SCM)： $$S \sim \mathcal{U}(0,1)$$ $$Q \coloneqq 0.2 S + 0.67 \epsilon_Q$$ $$X \coloneqq 0.14 Q + 0.4 \epsilon_X$$ $$Y \coloneqq 0.7 X + 0.11 Q + 0.32 S + 0.24 \epsilon_Y$$ $$P \coloneqq 0.43 X + 0.21 Y + 0.22 \epsilon_P$$ 真实 causal effect $X \to Y = 0.7$。DoWhy 找到 back-door estimand（控制 $Q$）。
DML (Double Machine Learning) estimator（Chernozhukov et al. 2016）：
用 ML 拟合 $Y$ on controls（residuals $\tilde{Y} = Y - \hat{E}[Y|Q]$）。
用 ML 拟合 $X$ on controls（residuals $\tilde{X} = X - \hat{E}[X|Q]$）。

回归 $\tilde{Y}$ on $\tilde{X}$，系数即 causal effect。 DoWhy 集成：

estimate = model.estimate_effect(
  identified_estimand=estimand,
  method_name='backdoor.econml.dml.DML',
  method_params={
    'init_params': {
      'model_y': GradientBoostingRegressor(),
      'model_t': GradientBoostingRegressor(),
      'model_final': LassoCV(fit_intercept=False),
    },
    'fit_params': {}
  }
)

全 5 节点 example：DML estimate = 0.6999（vs 真实 0.7，0.02% 误差）；linear regression = 0.6881（1.7% 误差）。DML 略好但在简单数据上"可能只是噪声"——作者建议在小数据集上先用简单模型。

Refutation tests in full example：
random_common_cause：new effect = 0.6692, p = 0.14 → 不显著，模型通过。
placebo_treatment_refuter：new effect = 0.0, p = 2.0 → 精确为 0，模型通过。
GCM API（Blobaum et al. 2022）：用 gcm.InvertibleStructuralCausalModel(graph_nx) 代替 GML；为每个节点 set causal mechanism（EmpiricalDistribution / AdditiveNoiseModel(linear_regressor)）；gcm.fit(causal_model, df) 后用 gcm.arrow_strength(causal_model, 'Y') 得到每条边对 $Y$ 方差的贡献比例。GCM 还能算 gcm.counterfactual_samples(causal_model, {'X': lambda x: .21}, observed_data=...) 做反事实——只需要 invertible SCM（与 abduction-modification-prediction 三步法对应）。

关键结论

DoWhy + EconML 是当前最成熟的 Python 因果工程栈——前者管 model/identify/refute，后者管 estimate（HTE、CATE、DML、Causal Forest、IV）。两者深度集成（backdoor.econml.dml.DML 字符串命名空间）。
四步法是生产代码的强制流程：model → identify → estimate → refute。跳过任何一步都是反模式——这是 ML 从业者做因果推断时最常犯的错。
CV 不适用于因果模型：CV 只能评估 rung-1 预测能力；rung-2 因果方向识别需要 refutation tests（Popperian 证伪）或真正的干预。
Refutation tests 是因果版的"压力测试"——加 random confounder、placebo treatment、subset data 三个常用 test 都通过时，模型稳健性才有基础保障。但通过 ≠ 正确——只能"没有明显错"。
GCM API 是 DoWhy 0.8 的未来方向：从 4 步法（命令式）转向"fit once, query many"（声明式）——一个 gcm.fit() 之后可同时回答 intervention、counterfactual、attribution 等多种查询。
DML 是 Part 2 的"主力 estimator"：用 ML 处理高维/非线性 confounding，但保持 $\sqrt{N}$ 一致性（Chernozhukov et al. 2016 证明）。DML 在小数据集上不一定优于简单线性回归——作者建议先用简单模型做基线。

挑战和开放性问题

Refutation tests 的局限性：通过 refutation 不等于因果正确——MEC 内的多个 DAG 都可能通过同一组 refutation tests。真正确认因果方向需要 intervention 或强假设。
DoWhy 0.8 → 1.0 的 API 不稳定：作者提到 GCM API "experimental"；从 0.8 到当前版本（≥ 0.11）API 已有显著变化，生产代码需要定期更新。
Front-door 的 refutation tests 不全：作者明确"DoWhy 0.8 has more refutation tests for back-door than for front-door"——front-door 模型的稳健性检验工具较少。
DML 的"高方差 vs 低偏差"权衡：DML 在非线性 confounding 下偏差小，但 ML 第一阶段的高方差会传染到第二阶段。Chernozhukov et al. (2016) 给出 sample splitting / cross-fitting 缓解，但工程实现细节多。
GML / networkx 互转的工程痛点：DoWhy 的 GML string 与 nx.DiGraph 不完全等价（边标签、节点属性、视觉属性）。生产代码需要手写转换层。
多 treatment / 多 outcome 的扩展：DoWhy 的 CausalModel 默认单 treatment / 单 outcome。真实医学数据常有多 treatment + 多 outcome 的复杂场景，需要手动扩展。

个人反思与批判性分析

本章是 Ch 1–6 概念到工程代码的"转折点"——也是很多 ML 从业者第一次能把因果推断装到生产流水线里的入口。值得讨论的几个层面：

DoWhy 的"四步法"哲学价值远大于技术细节：作者强调"step-by-step reproducibility"——这是与传统 ML "end-to-end 模型"思路的最大差异。在传统 ML 中，我们关心"模型在 holdout 上的表现"；在因果推断中，我们关心"在哪些假设下，模型对真实 DGP 的逼近是正确的"。四步法强迫使用者明确每个假设——这是生产代码中最难保证的事。
Bates, Hastie & Tibshirani (2021) 引用得不够深：作者用一句话提到 "CV is much less understood than we tend to think"——这是 Bates et al. 2021 的核心结论：k-fold CV 的估计量与 leave-one-out CV 不一致，且对模型选择、特征选择、超参选择都引入复杂偏差。在因果推断中，CV 不仅不适用，还可能严重误导（"模型在 CV 上表现好" $\neq$ "因果图正确"）。对生产团队这是一个危险信号：很多公司"做因果"的方法是"用 ML 模型预测潜在结果"——这本质上是 rung-1 工作，CV 上可能漂亮但因果推断无意义。
Refutation tests 的"非充分性"哲学：作者引用 Popper 证伪主义很好，但所有 refutation tests 通过不能证明模型正确——这与"科学理论不可证实只可证伪"的哲学一致。实际工程中的隐含问题：很多团队跑完 refutation tests 通过就宣布"因果模型有效"——这是统计学中的 type II error（接受假阳性）。更稳健的做法是多重敏感性分析（Cinelli & Hazlett 2020 的 omitted variable bias bounds）。
GCM API 的"fit once, query many"是 ML 化的标志：传统四步法是命令式（每种查询写一套代码），GCM API 是声明式（fit 一次后任意查询）。这一转变与 PyTorch / sklearn 的"estimator fit + predict"模式一致——是因果工具向 ML 工程师迁移的关键。代价是 GCM API 仍 experimental，长期稳定性待验证。
DoWhy 的"方法名命名空间"是工程化亮点：'backdoor.econml.dml.DML' 字符串把"estimand 类型. 库. 子模块. 类"统一起来——与 Python 包内 import 路径解耦。代价是字符串拼写错误只在运行时才被发现（IDE 难做类型检查）。生产代码中常用枚举类包装一层。
DML 的"为什么有效"教学不充分：作者用 3 步描述 DML，但没解释为什么 residuals 回归能给出因果效应。核心是 Frisch-Waugh-Lovell (FWL) 定理：在 OLS 上，对 controls 做 orthogonalize 后再回归，等价于包含 controls 的 full regression。Chernozhukov et al. (2016) 把 FWL 扩展到 ML（用 cross-fitting 避免 overfitting bias），得到 DML 的 $\sqrt{N}$ 一致性。对工程实践的启示：DML 不是"换 ML 替换 OLS"那么简单，cross-fitting 是必需步骤。
"Back-door 在大图中仍然唯一可行"是常见误区：full example 中 DoWhy 自动选 back-door 控制 $Q$，但 5 节点图中 $P$ 是 collider（$X \to P \leftarrow Y$），不能控制；$S$ 是 confounder 的祖先（$S \to Q$），控制 $S$ 也能 deconfound（valid 但不必要）。DoWhy 默认选 minimal adjustment set——但工程上常需要做 sensitivity analysis 看不同选择的影响。
对个人研究的启发：我在做血管生物力学时，标准流程是"收集影像 + 临床变量 → 跑 ML 预测管壁应力 → 看哪些变量重要"——这是 rung-1 工作，CV 上表现好但因果上无意义（不能用于推断"改变 X 是否能降低管壁破裂风险"）。DoWhy 的四步法 + EconML 的 DML 提供了一个统一框架：先画因果图（年龄、性别、血压、用药 → 管壁结构 → 破裂风险），用 back-door 识别控制集，用 DML 估计处理效应，用 refutation tests 检验。这是从"特征重要性"到"因果干预"的方法论升级。
Python 因果生态的"fragmentation"问题：20+ 库各自为政，DoWhy/EconML 双子星相对成熟，但 gCastle/causal-learn/Causica/DoubleML 互不兼容。生产代码的稳定性更多依赖 Microsoft Research（DoWhy/EconML）的长期支持，而不是算法先进性——这是"工业级 ≠ 学术最先进"的典型例子。

重要参考文献

[X1] Sharma, A., & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. arXiv:2011.04216 — DoWhy 库的论文；本章四步法的实现基础。
[X2] Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oka, P., Oprescu, M., & Syrgkanis, V. (2019). EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation. https://github.com/microsoft/EconML — EconML 库；CATE / DML / Causal Forest 的工程实现。
[X3] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2016). Double/Debiased Machine Learning for Treatment and Causal Parameters. arXiv:1608.00060 — DML 的开创性论文；cross-fitting 与 $\sqrt{N}$ 一致性。
[X4] Blobaum, P., Götz, P., Budhathoki, K., Mastakouri, A., & Janzing, D. (2022). DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models. arXiv — GCM API 的来源；可逆 SCM + counterfactual 自动生成。
[X5] Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: what does it estimate and how well does it do it? arXiv:2104.00673 — CV 局限性分析；本章"CV 不适用因果" 论断的来源。
[X6] Popper, K. (1935). Logik der Forschung. Springer — 证伪主义哲学；refutation tests 的理论基础。
[X7] Popper, K. (1959). The Logic of Scientific Discovery. Basic Books — 1935 著作的英文修订版。
[X8] Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press — ANM（additive noise model）的来源；GCM API 中 AdditiveNoiseModel 类的底层理论。
[X9] Molak, A. (2022, September 27). Causal Python: 3 Simple Techniques to Jump-Start Your Causal Inference Journey Today. Towards Data Science. https://towardsdatascience.com/causal-kung-fu-in-python-3-basic-techniques-to-jump-start-your-causal-inference-journey-tonight-ae09181704f7 — 作者博客，三种 DoWhy 入门技术。
[X10] Blöbaum, P., Janzing, D., Washio, T., Shimizu, S., & Schölkopf, B. (2018). Cause-Effect Inference by Comparing Regression Errors. AISTATS — ANM 在 GCM 中的理论基础。
[X11] Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC — 隐式引用（refutation 的医学背景）。
[X12] Pearl, J. (2009). Causality: Models, Reasoning and Inference (2nd ed.). Cambridge University Press — 反事实三步法的来源；GCM API counterfactual 功能的理论基础。
[X13] Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3), 197–243 — Bayesian network 结构学习（隐式引用；BDeu 评分）。