Hi,
I read your code and found there are two problems that hinder the performance improvement.
First, as I know, previous papers use head words of entity mentions as the candidate arguments, but you use the whole word sequence of entity mentions, which harms the argument-level performance a lot.
Second, while training, you train the argument-level classifier based on predicted triggers, instead, I believe the argument-level classifier should be trained on the golden triggers.