Skip to content

Commit 84ee930

Browse files
committed
ECAI 23 - code and data.
1 parent 9251949 commit 84ee930

File tree

174 files changed

+32336
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

174 files changed

+32336
-0
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
*~
2+
*#
3+
#*
4+
.#*
5+
.DS_Store*

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# ECAI 2023 - Graph Neural Networks For Mapping Variables Between Programs
2+
3+
This repository contains the code and data used for the paper "*Graph Neural Networks For Mapping Variables Between Programs*", accepted at ECAI 2023.
4+
5+
We present a novel graph program representation that is agnostic to the names of the variables and for each variable in the program contains a representative variable node that is connected to all the variable's occurrences. Furthermore, we use GNNs for mapping variables between programs based on our program representation, ignoring the variables' identifiers;
6+
We represent each program as a graph using the script gen_progs_repr.py, as explained in [1, 2].
7+
8+
To understand the entire pipeline, from the dataset generation to the repair process, the interested reader is refered to our main script 'run_all.sh'.
9+
10+
The generation of the datasets (training, validation and evaluation) and the training of our GNN has been commented since it takes a few hours to compute. Only the mapping computations and the program repair tasks were left uncommented. In this repo only the evaluation dataset is available.
11+
12+
- How to execute:
13+
14+
```
15+
chmod +x run_all.sh
16+
bash run_all.sh
17+
```
18+
19+
## Installation Requirements
20+
21+
The following script creates a new conda environment named 'gnn_env' and installs all the required dependencies in it.
22+
23+
```
24+
chmod +x config_gnn.sh
25+
bash confin_gnn.sh
26+
```
27+
28+
## Introductory Programming Assignments (IPAs) Dataset
29+
30+
31+
To generate our evaluation set of C programs we used [MultIPAs](https://github.com/pmorvalho/MultIPAs) [3], which is a program transformation framework, to augment our dataset of IPAs, [C-Pack-IPAs](https://github.com/pmorvalho/C-Pack-IPAs) [4].
32+
33+
## References
34+
35+
[1] Pedro Orvalho, Jelle Piepenbrock, Mikoláš Janota, and Vasco Manquinho. Graph Neural Networks For Mapping Variables Between Programs. ECAI 2023. [PDF](). *[Accepted for Publication]*
36+
37+
[2] Pedro Orvalho, Jelle Piepenbrock, Mikoláš Janota, and Vasco Manquinho. Project Proposal: Learning Variable Mappings to Repair Programs. AITP 2022. [PDF](http://aitp-conference.org/2022/abstract/AITP_2022_paper_15.pdf).
38+
39+
[3] Pedro Orvalho, Mikoláš Janota, and Vasco Manquinho. MultIPAs: Applying Program Transformations to Introductory Programming Assignments for Data Augmentation. In 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022. [PDF](https://dl.acm.org/doi/10.1145/3540250.3558931).
40+
41+
[4] Pedro Orvalho, Mikoláš Janota, and Vasco Manquinho. C-Pack of IPAs: A C90 Program Benchmark of Introductory Programming Assignments. 2022. [PDF](https://arxiv.org/pdf/2206.08768.pdf).

config_gnn.sh

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/usr/bin/env bash
2+
#Title : config_gnn.sh
3+
#Usage : bash config_gnn.sh
4+
#Author : pmorvalho
5+
#Date : July 29, 2022
6+
#Description : Jelle's commands to config the GNN's environment, plus the python packages needed to run MultIPAs
7+
#Notes :
8+
# (C) Copyright 2022 Pedro Orvalho.
9+
#==============================================================================
10+
11+
#!/usr/bin/env bash
12+
source ~/anaconda3/etc/profile.d/conda.sh
13+
conda create -n gnn_env python=3.9
14+
conda activate gnn_env
15+
#conda install pytorch torchvision torchaudio cpuonly -c pytorch
16+
#pip3 install torch torchvision torchaudio
17+
pip install torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cpu
18+
# check with nvidia-smi if the 'cu102' should be replaced with another cuda version
19+
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cpu.html
20+
pip install pycparser==2.21
21+
conda activate gnn_env
22+
23+

eval-dataset.zip

10.7 MB
Binary file not shown.

eval.py

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
import argparse
2+
from sys import argv
3+
import gzip
4+
import pickle
5+
from gnn import VariableMappingGNN
6+
import torch
7+
import time
8+
import gzip
9+
10+
def preprocess_data_test_time(left_ast, right_ast):
11+
left_edge_index_pairs = []
12+
left_edge_types = []
13+
for triple in left_ast['edges']:
14+
left_edge_index_pairs.append([triple[0], triple[1]])
15+
left_edge_types.append(triple[2])
16+
17+
left_node_types = [left_ast['nodes2types'][k] for k in left_ast['nodes2types']]
18+
19+
right_edge_index_pairs = []
20+
right_edge_types = []
21+
22+
for triple in right_ast['edges']:
23+
right_edge_index_pairs.append([triple[0], triple[1]])
24+
right_edge_types.append(triple[2])
25+
26+
right_node_types = [right_ast['nodes2types'][k] for k in right_ast['nodes2types']]
27+
28+
# var_norm_index = {k: e for (e, k) in enumerate(left_ast['vars2id'])}
29+
# var_norm_index2 = {k: e for (e, k) in enumerate(right_ast['vars2id'])}
30+
31+
left_node_types = torch.as_tensor(left_node_types)
32+
right_node_types = torch.as_tensor(right_node_types)
33+
34+
left_edge_index_pairs = torch.as_tensor(left_edge_index_pairs)
35+
right_edge_index_pairs = torch.as_tensor(right_edge_index_pairs)
36+
37+
left_edge_types = torch.as_tensor(left_edge_types)
38+
right_edge_types = torch.as_tensor(right_edge_types)
39+
40+
return ((left_node_types, left_edge_index_pairs, left_edge_types, left_ast),
41+
(right_node_types, right_edge_index_pairs, right_edge_types, right_ast))
42+
43+
def load_model(model_location):
44+
device = "cpu"
45+
with gzip.open("types2int.pkl.gz", 'rb') as f:
46+
node_type_mapping = pickle.load(f)
47+
print(node_type_mapping)
48+
# find maximum index
49+
num_types = node_type_mapping['diff_types']
50+
print(num_types)
51+
52+
gnn = VariableMappingGNN(num_types, device).to(device)
53+
54+
gnn.load_state_dict(
55+
torch.load(model_location, map_location=torch.device('cpu')))
56+
57+
return gnn
58+
59+
def predict(gnn_model, left_ast_file, right_ast_file):
60+
61+
with gzip.open(left_ast_file, 'rb') as f:
62+
left_ast = pickle.load(f)
63+
64+
with gzip.open(right_ast_file, 'rb') as f:
65+
right_ast = pickle.load(f)
66+
67+
left_ast, right_ast = preprocess_data_test_time(left_ast, right_ast)
68+
69+
op_var_dict, op_dist_dict = gnn_model.test_time_output((left_ast, right_ast))
70+
71+
return op_var_dict, op_dist_dict
72+
73+
def save_var_maps(var_dict, p_name):
74+
fp=gzip.open(p_name,'wb')
75+
pickle.dump(var_dict,fp)
76+
fp.close()
77+
78+
79+
def parser():
80+
parser = argparse.ArgumentParser(prog='prog_fixer.py', formatter_class=argparse.RawTextHelpFormatter)
81+
parser.add_argument('-ia', '--inc_ast', help='Incorrect Program\'s AST.')
82+
parser.add_argument('-ca', '--cor_ast', help='Correct program\'s AST.')
83+
parser.add_argument('-m', '--var_map', help='Variable mapping Path.')
84+
parser.add_argument('-md', '--var_map_dist', help='Path for the each variable mapping distribution.')
85+
parser.add_argument('-gm', '--gnn_model', help='GNN model to use.')
86+
parser.add_argument('-t', '--time', help='File where the time spent predicting the model will be written to.')
87+
parser.add_argument('-v', '--verbose', action='store_true', default=False, help='Prints debugging information.')
88+
args = parser.parse_args(argv[1:])
89+
return args
90+
91+
92+
if __name__ == "__main__":
93+
args = parser()
94+
95+
model_location = args.gnn_model
96+
model = load_model(model_location)
97+
98+
buggy_ast_file = args.inc_ast
99+
correct_ast_file = args.cor_ast
100+
101+
time_0 = time.time()
102+
model_output, model_output_distributions = predict(model, buggy_ast_file, correct_ast_file)
103+
time_f = time.time()-time_0
104+
with open(args.time, 'w+') as writer:
105+
writer.writelines("Time: {t}".format(t=round(time_f,3)))
106+
107+
print(model_output)
108+
save_var_maps(model_output, args.var_map)
109+
save_var_maps(model_output_distributions, args.var_map_dist)
110+
# TODO Use the {model_output_distributions} to sample

gen_eval_dataset.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/usr/bin/python
2+
#Title : gen_eval_dataset.py
3+
#Usage : python gen_eval_dataset.py -h
4+
#Author : pmorvalho
5+
#Date : July 25, 2022
6+
#Description : For each program mutation and mutilation configuration (input dir), this script chooses randomly a mutilated program for each student and guarantes that this program is semantically incorrect.
7+
#Notes :
8+
#Python Version: 3.8.5
9+
# (C) Copyright 2022 Pedro Orvalho.
10+
#==============================================================================
11+
12+
import argparse
13+
from sys import argv
14+
import os
15+
import random
16+
17+
from helper import *
18+
19+
20+
def choose_incorrect_programs(input_dir, output_dir):
21+
subfolders = [f.path for f in os.scandir(input_dir) if f.is_dir()]
22+
for sf in subfolders:
23+
stu_id = str(sf).split("/")[-1]
24+
# print(sf+"/")
25+
muts = list(pathlib.Path(sf+"/").glob('*/')) # mutations directories
26+
prog_found = False
27+
while len(muts) > 0 and not prog_found:
28+
mut = random.sample(muts, 1)[0]
29+
muts.remove(mut)
30+
m = int(str(mut).split("/")[-1].replace("-",""))
31+
if m == 0: # ignoring programs that were not mutated (index 0, 00, 000). E.g. programs that have a for-loop and are inside the for2while directory.
32+
print("Ignoring ", mut)
33+
continue
34+
progs = list(pathlib.Path("{d}".format(d=mut)).glob('*.c')) # mutilated programs
35+
while len(progs) > 0:
36+
p = random.sample(progs, 1)[0]
37+
progs.remove(p)
38+
p = str(p)
39+
if not check_program(p, args.ipa):
40+
prog_found=True
41+
if args.verbose:
42+
print("Program ",p, " has been chosen.")
43+
p=p.split("/")[-1][:-2]
44+
os.system("cp {d}/{p}.c {o}/{s}.c".format(d=mut, p=p, o=args.output_dir, s=stu_id))
45+
os.system("cp {d}/ast-{p}.pkl.gz {o}/ast-{s}.pkl.gz".format(d=mut, p=p, o=args.output_dir, s=stu_id))
46+
os.system("cp {d}/bug_map-{z}-{p}.pkl.gz {o}/bug_map-{s}.pkl.gz".format(d=mut, z="0"*len(p),p=p, o=args.output_dir, s=stu_id))
47+
os.system("cp {d}/var_map-{z}_{p}.pkl.gz {o}/var_map-{s}.pkl.gz".format(d=mut, z="0"*len(p),p=p, o=args.output_dir, s=stu_id))
48+
os.system("rm {d}/*.o".format(d=mut))
49+
break
50+
# os.system("rm {d}/*.o".format(d=mut))
51+
52+
53+
def parser():
54+
parser = argparse.ArgumentParser(prog='gen_eval_dataset.py', formatter_class=argparse.RawTextHelpFormatter)
55+
parser.add_argument('-d', '--input_dir', nargs='?',help='input directory.')
56+
parser.add_argument('-e', '--ipa', help='Name of the lab and exercise (IPA) so we can check the IO tests.')
57+
parser.add_argument('-i', '--ignore', action='store_true', default=False, help='ignores...')
58+
parser.add_argument('-o', '--output_dir', nargs='?', help='the name of the output dir.')
59+
parser.add_argument('-v', '--verbose', action='store_true', default=False, help='Prints debugging information.')
60+
args = parser.parse_args(argv[1:])
61+
return args
62+
63+
if __name__ == '__main__':
64+
args = parser()
65+
choose_incorrect_programs(args.input_dir, args.output_dir)

gen_gnn_variable_mappings.sh

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env bash
2+
#Title : gen_gnn_variable_mappings.sh
3+
#Usage : bash gen_gnn_variable_mappings.sh
4+
#Author : pmorvalho
5+
#Date : July 29, 2022
6+
#Description : Generates a var mapping for each pair of incorrect/correct programs in the eval dataset using the GNN
7+
#Notes :
8+
# (C) Copyright 2022 Pedro Orvalho.
9+
#==============================================================================
10+
11+
# to activate clara python-environment
12+
if [ -f ~/opt/anaconda3/etc/profile.d/conda.sh ]; then
13+
# if running on MacOS
14+
source ~/opt/anaconda3/etc/profile.d/conda.sh
15+
else
16+
# if running on ARSR Machines (Linux)
17+
source ~/anaconda3/etc/profile.d/conda.sh
18+
fi
19+
20+
conda activate gnn_env
21+
22+
initial_dir=$(pwd)
23+
data_dir="eval-dataset"
24+
gnn_models_dir="gnn_models"
25+
var_maps_dir=$initial_dir"/variable_mappings"
26+
TIMEOUT_REPAIR=600 # in seconds
27+
28+
#labs=("lab02" "lab03" "lab04") # we are not considering lab05 for this dataset, and only year 2020/2021 has lab05.
29+
labs=("lab02")
30+
# we will only use lab02 of the second year as the evaluation dataset
31+
# models=("wco" "vm" "ed" "all")
32+
#models=("wco")
33+
#models=("vm")
34+
#models=("ed")
35+
models=("all")
36+
37+
for((m=0;m<${#models[@]};m++));
38+
do
39+
model=${models[$m]}
40+
echo $model
41+
results_dir="results/var_maps-"$model
42+
mkdir -p $results_dir
43+
for((l=0;l<${#labs[@]};l++));
44+
do
45+
lab=${labs[$l]}
46+
for ex in $(find $data_dir/incorrect_submissions/$lab/ex* -maxdepth 0 -type d);
47+
do
48+
ex=$(echo $ex | rev | cut -d '/' -f 1 | rev)
49+
for mut_dir in $(find $data_dir/incorrect_submissions/$lab/$ex/* -maxdepth 0 -mindepth 0 -type d);
50+
do
51+
mut=$(echo $mut_dir | rev | cut -d '/' -f 1 | rev)
52+
for mutl_dir in $(find $data_dir/incorrect_submissions/$lab/$ex/$mut/* -maxdepth 0 -mindepth 0 -type d);
53+
do
54+
mutl=$(echo $mutl_dir | rev | cut -d '/' -f 1 | rev)
55+
echo "Dealing with "$mutl_dir
56+
for p in $(find $data_dir/incorrect_submissions/$lab/$ex/$mut/$mutl/*.c -maxdepth 0 -mindepth 0 -type f);
57+
do
58+
stu_id=$(echo $p | rev | cut -d '/' -f 1 | rev)
59+
stu_id=$(echo $stu_id | sed "s/.c//g")
60+
d=$results_dir/$lab/$ex/$mut/$mutl/$stu_id-$model
61+
mkdir -p $d $initial_dir/variable_mappings/$model/$lab/$ex/$mut/$mutl
62+
c_prog_ast=$(find $data_dir/correct_submissions/$lab/$ex/"ast-"$stu_id* -type f | tail -n 1)
63+
i_prog_ast=$mutl_dir/"ast-"$stu_id".pkl.gz"
64+
gnn_model=$(find $gnn_models_dir/$model*.pt -type f | tail -1 )
65+
# /home/pmorvalho/runsolver/src/runsolver -o $d/out.o -w $d/watcher.w -v $d/var.v -W $TIMEOUT_REPAIR --rss-swap-limit 32000 \
66+
python3 eval.py -ia $i_prog_ast -ca $c_prog_ast -m $var_maps_dir/$model/$lab/$ex/$mut/$mutl"/var_map-"$stu_id".pkl.gz" -md $var_maps_dir/$model/$lab/$ex/$mut/$mutl"/var_map_distributions-"$stu_id".pkl.gz" -gm $gnn_model -t $d/var_map_time.txt > $d/out.o
67+
done
68+
# wait
69+
done
70+
done
71+
done
72+
done
73+
done
74+

0 commit comments

Comments
 (0)