-
Notifications
You must be signed in to change notification settings - Fork 205
Description
The below function:
def get_count_risk_rolling_window(terminal_transactions, delay_period=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"):
terminal_transactions=terminal_transactions.sort_values('TX_DATETIME')
terminal_transactions.index=terminal_transactions.TX_DATETIME
NB_FRAUD_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').sum()
NB_TX_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').count()
for window_size in windows_size_in_days:
NB_FRAUD_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').sum()
NB_TX_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').count()
NB_FRAUD_WINDOW=NB_FRAUD_DELAY_WINDOW-NB_FRAUD_DELAY
NB_TX_WINDOW=NB_TX_DELAY_WINDOW-NB_TX_DELAY
RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW
terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
terminal_transactions[feature+'_RISK_'+str(window_size)+'DAY_WINDOW']=list(RISK_WINDOW)
terminal_transactions.index=terminal_transactions.TRANSACTION_ID
# Replace NA values with 0 (all undefined risk scores where NB_TX_WINDOW is 0)
terminal_transactions.fillna(0,inplace=True)
return terminal_transactions
May assign features to the wrong transactions.
Looking at the steps:
- It first assigns 'TX_DATETIME' as an index to [terminal_transactions] (which is necessary for the .rolling() operation
terminal_transactions.index = terminal_transactions.TX_DATETIME
-
It then creates new columns by converting a Series to lists
terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW'] = list(NB_TX_WINDOW) -
The [NB_TX_WINDOW] is a Pandas Series indexed by 'TX_DATETIME'. When converted to a list, the index is stripped away and the values are assigned based completely on pure row order. If the DataFrame is already manipulated this won't work.
-
The index is reassigned to a different column 'TRANSACTION_ID'. This new feature columns that were added before '_NB_TX..', 'RISK...', were calculated and placed in rows based on the old 'TX_DATETIME' index. Now, the rows are moved to a new index so the features will be assigned to the wrong transactions.
terminal_transactions.index = terminal_transactions.TRANSACTION_ID
Example: The feature calculated for a transaction at 2023-01-05 10:00:00 might now be attached to a transaction with a different TRANSACTION_ID that happened on a completely different day.
This can be corrected (in case this mis-assignment comes up) by using the .loc() and the index directly:
df.loc[df_index_datetime.index, feature + '_NB_TX_' + str(window_size) + 'DAY_WINDOW'] = NB_TX_WINDOW.values
df.loc[df_index_datetime.index, feature + '_RISK_' + str(window_size) + 'DAY_WINDOW'] = RISK_WINDOW.values
PS: This book is an amazing resource. Precise, short, to the point.