AI-Based Trial Balance Forecasting using Random Forest
Objective:
The primary goal of this notebook is to forecast financial figures—specifically Credit, Debit, and Balance amounts—for various account codes on a monthly basis for the upcoming fiscal year. By utilizing machine learning, specifically Random Forest Regression, the notebook automates financial predictions based on patterns found in historical trial balance data. This enables organizations to proactively plan finances, allocate budgets, and identify irregularities or potential risks. The outputs are organized into Excel sheets for user-friendly review and analysis.
Data Source:
The input to this notebook is an Excel workbook consisting of multiple sheets. Each sheet corresponds to a particular month and includes:
- Account Code (N-Code or equivalent)
- Account Description
- Debit Amount
- Credit Amount
- Balance (net value)
- Other metadata such as opening balance (OB), which is dropped during cleaning
The sheet names are used to extract the month information for time-based modeling.
Data Preprocessing:
Key steps include:
- Sheet Consolidation: All monthly sheets are read into pandas and concatenated into one comprehensive
DataFrame. - Column Cleanup: Unnecessary columns like ‘OB’ (Opening Balance) are removed.
- Null Handling: Missing values are filled using appropriate techniques (e.g., forward fill, zeros).
- Date Engineering: Sheet names are parsed to extract month-year, which is then converted to numeric
format for modeling. - Categorical Encoding: Account codes and descriptions are transformed to numerical values where necessary for model compatibility.
- Lag Feature Engineering: Previous month values are incorporated for trend learning (e.g., last month’s balance influencing the next).
Model Architecture:
- Model: RandomForestRegressor from the scikit-learn library.
- Model details:
- Trained separately for each target variable: Debit, Credit, and Balance.
- Input features include: account code, month index, and lagged historical values.
- Model Parameters: Number of estimators (trees), max depth, and random state are set to ensure reproducibility.
- Training involves splitting historical data (e.g., FY 2022–2023) and predicting for future months (e.g., FY 2023–2024).
- The model’s goal is to minimize forecasting error and generalize well to unseen months.
Output:
The notebook generates several key deliverables:
- Forecast File: An Excel workbook containing predicted Credit, Debit, and Balance for each account by month.
- Sheet-wise Organization: Optionally, each predicted month can be exported as an individual Excel sheet.
- Graphs and Visualizations: Includes line charts comparing real vs. predicted values for top account codes.
- Performance Logs: Accuracy, precision, recall, and F1-score for classification of variances (if
implemented).
Technologies Used:
- Python: The base language for all preprocessing, modeling, and output automation.
- Pandas: Handles tabular data, Excel input/output, and cleaning routines.
- NumPy: Used for numerical operations and array handling.
- scikit-learn: Powers the Random Forest Regression model and provides evaluation tools.
- Matplotlib / Seaborn: Generates trend plots for visual comparison of actual vs. forecasted data.
- OpenPyXL: Automates structured writing of predictions into formatted Excel files.
- Gradio: (If integrated) Builds a no-code user interface for data upload and result download.