This research introduces a Causal-Driven Attribution (CDA) framework that estimates channel influence exclusively from aggregated impression-level data, requiring no user identifiers or click-path tracking. Our method integrates a causal discovery algorithm and a computational approach that identifies directional causal relationships from observational data using score-based methods. Through this process, we gain graphical causal graphs and signigicant causal effects estimations. For more details about modelling, please refer to the Causal-driven attribution section.
Moreover, we developed an approach that allowed us to generate a synthetic dataset. Our aim is not only to create a comprehensive and representative marketing dataset but moreover to be able to establish the so-called ground truth required for model validation, thereby bypassing the problem of insufficient data. For details on how to generate synthetic data for experiments, see the Data Generation section.
We provide an open-source Python package that allows users to simulate data from scratch. Here’s how it works:
-
Users input characteristics about their business such as media channels and plausible ranges for conversion rates based on their historical experience with those channels.
-
The simulation models the data according to the specified assumptions and formulas, adding statistical noise to each generated sample.
-
The outputs are:
-
A synthetic dataset for causal marketing attribution
-
A causal graph showing relationships between channels and their individual effects on others.
-
git clone https://github.com/Tayerquach/causal_driven_attribution.git
cd causal_driven_attributionAfter cloning repository github, going to the main folder causal_driven_attribution and do the steps as follows
-
If you have not installed Python, please install it first:
Install Python (Setup instruction).
Note: In this project, we used Python 3.10.18
-
Install Conda (Conda Installation), or miniconda (Miniconda Installation), or similar environment systems
-
Create a virtual enviroment
conda create --name [name of env] python==[version]Example:
conda create --name causal python==3.10.18- Check list of conda environments
conda env list- Activate enviroment
conda activate [name of env]- Install Python packages
pip install -r requirements.txt pip or poetry) is not enough. You must also install the system-level Graphviz engine.
MacOS installation
If you are running the project on macOS, install Graphviz using Homebrew after setting up the Python environment:
brew install graphvizUbuntu / Linux installation
If you are running the project on Ubuntu or another Linux distribution, install Graphviz using:
sudo apt update
sudo apt install graphviz
To generate marketing attribution data as shown in the Data Generation section, please refer to the notebook at reproduce/examples/example_simulation.ipynb. The simulated attribution data is shown below.
For running multiple simulations, first configure the number of SEEDs in the reproduce/utils/configs.py file. If you would like to modify any parameter values such as the number of channels, conversion rates, and so on, please do so in the config.py file.
reproduce/utils/configs.py
# Node definitions
NODE_LOOKUP = {
0: "Facebook",
1: "Google Ads",
2: "TikTok",
3: "Youtube",
4: "Affiliates",
5: "conversion",
}
# Name of the activity (e.g., "impression", "click", "spend")
ACTIVITY_NAME = "impression"
# Target sink node
TARGET_NODE = "conversion"
# Time series length (days)
TIME_PERIODS = 365 # 1 year of daily data
# Base value range for impressions
BASE_RANGE = (1000, 2000)
# Number of random seeds to generate
N_SEEDS = 1000Then, run the following command line:
python -m reproduce.generate_data
After the simulation, the generated data will be saved in reproduce/results/data.
The choice of time lag in PCMCI determines how far back in time the algorithm looks to identify causal relationships between time series variables. Selecting an appropriate lag is important for producing meaningful and computationally feasible causal models. In marketing, incrementality lift tests are commonly used to measure causal impact, meaning they help determine what actually caused a change in performance. If
To re-run the experiment, please run this command
python -m reproduce.run_experiment --seed_min=0 --seed_max=999 --tau_min=0 --tau_max=60 --priority=False
About the --priority Flag
--priority=False (default):
No prioritization is applied when evaluating candidate causal parents. All nodes are treated equally during causal discovery.
--priority=True:
Gives higher weight to edges directed toward the target node, effectively prioritizing causal detection for that variable.
This option is useful when the research focus is specifically on improving the accuracy of causal discovery for a designated outcome variable.
In this study, we used the default setting --priority=False, meaning no prioritization was applied.
To evaluate the causal graph discovery results (including optimal lag selection), open and run:
run_evaluation.ipynbThis notebook provides evaluation metrics for all 1,000 predicted causal DAGs, including:
-
SHD (Structural Hamming Distance)
-
SID (Structural Intervention Distance)
-
AUC (Area Under the ROC Curve)
-
TPR (True Positive Rate)
-
FPR (False Positive Rate)
-
F0.5 (Precision-weighted F-score)
Please refer to the paper for a detailed explanation of why we set
To estimate causal effects for a single simulated dataset, explore the example notebooks located in the examples/ directory:
examples/
│
├── example_simulation.ipynb
└── example_causal_driven_attribution.ipynb
-
example_simulation.ipynb
Allows you to generate any synthetic dataset based on the structural model described in the paper. -
example_causal_driven_attribution.ipynb
Demonstrates how to apply the full CDA workflow to compute causal effect estimates for the generated sample.
These notebooks walk through the complete pipeline from data generation to causal discovery and causal effect estimation, enabling you to reproduce or extend the experimental results. The figure below shows the example's results related to causal DAG and CDA.
To run the full multi-simulation experiment for the Causal-Driven Attribution (CDA) model, run:
python -m reproduce.run_cda_multi_simulation --num_per_layers=[N] --priority=[True/False] --tau_max=[T]Parameter Description
-
--num_per_layers=[N]Specifies the number of synthetic samples to generate for each DAG depth (from 2 to 6 layers). For example, N=200 generates 200 datasets per depth. -
--priority=[True/False]Determines whether PCMCI prioritizes edges pointing to the target node during causal discovery. -
True: prioritization enabled
-
False: no prioritization (default)
-
--tau_max=[T]Maximum time lag considered in causal discovery using PCMCI. Set T according to your data's temporal structure.
Example,
python -m reproduce.run_cda_multi_simulation --num_per_layers=200 --priority=True --tau_max=45Note: Use this script when you want to reproduce the large-scale benchmarking experiment reported in the paper.
This command generates synthetic datasets and evaluates the CDA framework across a range of causal structures. For each DAG layer from two to six layers, the script creates 200 samples per layer, applies PCMCI for causal discovery (with
The output is a dictionary containing, for each simulation seed:
-
The number of causal layers (n_layers)
-
The layered DAG structure (layers)
-
Evaluation metrics under the true DAG: Relative RMSE, MAPE, Spearman correlation
-
Evaluation metrics under the predicted DAG: Relative RMSE, MAPE, Spearman correlation
All results are stored in: reproduce/results/eval
After running the multi-simulation experiment, you can summarize and analyze the full set of results using reproduce/evaluate_multi_sample.ipynb.
This notebook performs the following tasks:
- Aggregates results from all simulated datasets across varying DAG depths.
- Compares the performance of CDA when using:
- the true causal graph, and
- the predicted causal graph obtained through PCMCI.
- Evaluates differences across causal layer depths (e.g., from 2 to 6 layers), showing how structural complexity affects estimation accuracy.
- Computes summary statistics such as mean and standard deviation for key metrics (RRMSE, MAPE, Spearman correlation).
- Visualizes performance trends, helping users understand how CDA behaves under different causal structures.
Use this notebook to reproduce the main benchmarking analysis reported in the paper and to inspect how the model performs across a broad range of synthetic causal environments.
If you find this CDA useful for your work, please cite our paper:
To be updated upon publication.If you encounter an issue or have a specific request for CDA, please raise an issue.
This project welcomes contributions and suggestions. For a guide to contributing and a list of all contributors, check out CONTRIBUTING.md
- Georgios Filippou (Marsci, info@mar-sci.com)
- Boi Mai Quach (Marsci, boi@mar-sci.com)





