Index on pandas dataframes

The below function: 

```
def get_count_risk_rolling_window(terminal_transactions, delay_period=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"):
    
    terminal_transactions=terminal_transactions.sort_values('TX_DATETIME')
    
    terminal_transactions.index=terminal_transactions.TX_DATETIME
    
    NB_FRAUD_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').sum()
    NB_TX_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').count()
    
    for window_size in windows_size_in_days:
    
        NB_FRAUD_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').sum()
        NB_TX_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').count()
    
        NB_FRAUD_WINDOW=NB_FRAUD_DELAY_WINDOW-NB_FRAUD_DELAY
        NB_TX_WINDOW=NB_TX_DELAY_WINDOW-NB_TX_DELAY
    
        RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW
        
        terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        terminal_transactions[feature+'_RISK_'+str(window_size)+'DAY_WINDOW']=list(RISK_WINDOW)
        
    terminal_transactions.index=terminal_transactions.TRANSACTION_ID
    
    # Replace NA values with 0 (all undefined risk scores where NB_TX_WINDOW is 0) 
    terminal_transactions.fillna(0,inplace=True)
    
    return terminal_transactions

```

May assign features to the wrong transactions. 
Looking at the steps:

- It first assigns 'TX_DATETIME' as an index to [terminal_transactions] (which is necessary for the .rolling() operation
```
terminal_transactions.index = terminal_transactions.TX_DATETIME
```

- It then creates new columns by converting a Series to lists 
    ```
     terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW'] = list(NB_TX_WINDOW)
   ```

- The [NB_TX_WINDOW] is a Pandas Series indexed by 'TX_DATETIME'. When converted to a list, the index is stripped away and the values are assigned based completely on pure row order. If the DataFrame is already manipulated this won't work.


- The index is reassigned to a different column 'TRANSACTION_ID'. This new feature columns that were added before '_NB_TX..', '_RISK_...', were calculated and placed in rows based on the old 'TX_DATETIME' index. Now, the rows are moved to a new index so the features will be assigned to the wrong transactions. 
```
terminal_transactions.index = terminal_transactions.TRANSACTION_ID  
```



Example: The feature calculated for a transaction at 2023-01-05 10:00:00 might now be attached to a transaction with a different TRANSACTION_ID that happened on a completely different day.


This can be corrected (in case this mis-assignment comes up) by using the .loc() and the index directly:
```
        df.loc[df_index_datetime.index, feature + '_NB_TX_' + str(window_size) + 'DAY_WINDOW'] = NB_TX_WINDOW.values
        df.loc[df_index_datetime.index, feature + '_RISK_' + str(window_size) + 'DAY_WINDOW'] = RISK_WINDOW.values
```


PS: This book is an amazing resource. Precise, short, to the point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index on pandas dataframes #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Index on pandas dataframes #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions