Skip to content

Index on pandas dataframes #23

@KonstantinosTsoumas

Description

@KonstantinosTsoumas

The below function:

def get_count_risk_rolling_window(terminal_transactions, delay_period=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"):
    
    terminal_transactions=terminal_transactions.sort_values('TX_DATETIME')
    
    terminal_transactions.index=terminal_transactions.TX_DATETIME
    
    NB_FRAUD_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').sum()
    NB_TX_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').count()
    
    for window_size in windows_size_in_days:
    
        NB_FRAUD_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').sum()
        NB_TX_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').count()
    
        NB_FRAUD_WINDOW=NB_FRAUD_DELAY_WINDOW-NB_FRAUD_DELAY
        NB_TX_WINDOW=NB_TX_DELAY_WINDOW-NB_TX_DELAY
    
        RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW
        
        terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        terminal_transactions[feature+'_RISK_'+str(window_size)+'DAY_WINDOW']=list(RISK_WINDOW)
        
    terminal_transactions.index=terminal_transactions.TRANSACTION_ID
    
    # Replace NA values with 0 (all undefined risk scores where NB_TX_WINDOW is 0) 
    terminal_transactions.fillna(0,inplace=True)
    
    return terminal_transactions

May assign features to the wrong transactions.
Looking at the steps:

  • It first assigns 'TX_DATETIME' as an index to [terminal_transactions] (which is necessary for the .rolling() operation
terminal_transactions.index = terminal_transactions.TX_DATETIME
  • It then creates new columns by converting a Series to lists

     terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW'] = list(NB_TX_WINDOW)
    
  • The [NB_TX_WINDOW] is a Pandas Series indexed by 'TX_DATETIME'. When converted to a list, the index is stripped away and the values are assigned based completely on pure row order. If the DataFrame is already manipulated this won't work.

  • The index is reassigned to a different column 'TRANSACTION_ID'. This new feature columns that were added before '_NB_TX..', 'RISK...', were calculated and placed in rows based on the old 'TX_DATETIME' index. Now, the rows are moved to a new index so the features will be assigned to the wrong transactions.

terminal_transactions.index = terminal_transactions.TRANSACTION_ID  

Example: The feature calculated for a transaction at 2023-01-05 10:00:00 might now be attached to a transaction with a different TRANSACTION_ID that happened on a completely different day.

This can be corrected (in case this mis-assignment comes up) by using the .loc() and the index directly:

        df.loc[df_index_datetime.index, feature + '_NB_TX_' + str(window_size) + 'DAY_WINDOW'] = NB_TX_WINDOW.values
        df.loc[df_index_datetime.index, feature + '_RISK_' + str(window_size) + 'DAY_WINDOW'] = RISK_WINDOW.values

PS: This book is an amazing resource. Precise, short, to the point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions