@@ -9,6 +9,7 @@ Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian
99![ DeepDB Overview] ( baselines/plots/overview.png " DeepDB Overview ")
1010
1111# Setup
12+ Tested with python3.7 and python3.8
1213```
1314git clone https://github.com/DataManagementLab/deepdb-public.git
1415cd deepdb-public
@@ -18,21 +19,12 @@ source venv/bin/activate
1819pip3 install -r requirements.txt
1920```
2021
21- # How to experiment with DeepDB on a new Dataset
22- - Specify a new schema in the schemas folder
23- - Due to the current implementation, make sure to declare
24- - the primary key,
25- - the filename of the csv sample file,
26- - the correct table size and sample rate,
27- - the relationships among tables if you do not just run queries over a single table,
28- - any non-key functional dependencies (this is rather an implementation detail),
29- - and include all columns in the no-compression list by default (as done for the IMDB benchmark),
30- - To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
31- - Generate the HDF/sampled HDF files and learn the RSPN ensemble
32- - Use the RSPN ensemble to answer queries
33- - For reference, please check the commands to reproduce the results of the paper
22+ For python3.8: Sometimes spflow fails, in this case remove spflow from requirements.txt, install them and run
23+ ```
24+ pip3 install spflow --no-deps
25+ ```
3426
35- # How to Reproduce Experiments in the Paper
27+ # Reproduce Experiments
3628
3729## Cardinality Estimation
3830Download the [ Job dataset] ( http://homepages.cwi.nl/~boncz/job/imdb.tgz ) .
@@ -288,3 +280,61 @@ python3 maqp.py --evaluate_confidence_intervals
288280 --confidence_upsampling_factor 100
289281 --confidence_sample_size 10000000
290282```
283+
284+ ### TPC-DS (Single Table) pipeline
285+ As an additional example on how to work with DeepDB, we provide an example on just a single table of the TPC-DS schema for the queries in ` ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql ` . As a prerequisite, you need a 10 million tuple sample of the store_sales table in the directory ` ../mqp-data/tpc-ds-benchmark/store_sales_sampled.csv ` . Afterwards,
286+ you can run the following commands. To compute the ground truth, you need a postgres instance with a 1T TPC-DS dataset.
287+
288+ Generate hdf files from csvs
289+ ```
290+ python3 maqp.py --generate_hdf
291+ --dataset tpc-ds-1t
292+ --csv_seperator |
293+ --csv_path ../mqp-data/tpc-ds-benchmark
294+ --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
295+ ```
296+
297+ Learn the ensemble
298+ ```
299+ python3 maqp.py --generate_ensemble
300+ --dataset tpc-ds-1t
301+ --samples_per_spn 10000000
302+ --ensemble_strategy single
303+ --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
304+ --ensemble_path ../mqp-data/tpc-ds-benchmark/spn_ensembles
305+ --rdc_threshold 0.3
306+ --post_sampling_factor 10
307+ ```
308+
309+ Compute ground truth
310+ ```
311+ python3 maqp.py --aqp_ground_truth
312+ --dataset tpc-ds-1t
313+ --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
314+ --target_path ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
315+ --database_name tcpds
316+ ```
317+
318+ Evaluate the AQP queries
319+ ```
320+ python3 maqp.py --evaluate_aqp_queries
321+ --dataset tpc-ds-1t
322+ --target_path ./baselines/aqp/results/deepDB/tpcds1t_model_based.csv
323+ --ensemble_location ../mqp-data/tpc-ds-benchmark/spn_ensembles/ensemble_single_tpc-ds-1t_10000000.pkl
324+ --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
325+ --ground_truth_file_location ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
326+ ```
327+
328+ # How to experiment with DeepDB on a new Dataset
329+ - Specify a new schema in the schemas folder
330+ - Due to the current implementation, make sure to declare
331+ - the primary key,
332+ - the filename of the csv sample file,
333+ - the correct table size and sample rate,
334+ - the relationships among tables if you do not just run queries over a single table,
335+ - any non-key functional dependencies (this is rather an implementation detail),
336+ - and include all columns in the no-compression list by default (as done for the IMDB benchmark),
337+ - To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
338+ - Generate the HDF/sampled HDF files and learn the RSPN ensemble
339+ - Use the RSPN ensemble to answer queries
340+ - For reference, please check the commands to reproduce the results of the paper
0 commit comments