{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Improving Numerical Integration Methods\n", "\n", "The solution and estimation of Dynamic Discrete Choice Models (DDCM) is often constrained by computational feasibility. For instance, Keane and Wolpin (1997) and subsequent work that estimates DDCM's of post-graduation career dynamics abstract from many important determinants of earnings and mobility dynamics. Examples include the isolation from match heterogeneity, permanent skill shocks, and absence of multidimensional skill structures.\n", "\n", "Keane and Wolpin (1994, 1997) split their occupational classes into white- and blue-collar occupations. Nevertheless, empirical evidence suggests that skill requirements vary substantially within blue- and white-collar occupations. Arguably any aggregation of occupational classes should be able to account for meaningful skill differences. Acemoglu and Autor (2011) suggest four aggregate groups that are explicitly derived from the task content of classified three digit occupations in the US data.[1](#fn1)\n", "\n", "Adding elements alike enlarges the computational burden of solving a DDCM enormously, and as already Keane and Wolpin (1994) noted, \"[...] for problems of the size we would like to consider a [...] simulation strategy is not computationally practicable\". A bottleneck in solving and estimating DDCM's is the solution of the expected value function, the so-called $EMax(\\cdot)$. Adding new features to the model increases the required number of function evaluations, which are the costly operation in numerical integration. Results from applied mathematics suggest methods that are more efficient and thus enable a performance increase. For the same number of function evaluations (and hence computational cost) quasi-Monte Carlo methods achieve a significantly higher accuracy.\n", "\n", "With **respy** it is possible to employ quasi-Monte Carlo methods to solve and estimate DDCM's. This notebook contains a tutorial on how to specify the numerical integration option in **respy**. For expositional purposes we will rely on the Keane and Wolpin (1994) model. The subsequent sections:\n", "\n", "- Will explain how to choose a Monte Carlo method in **respy**.\n", "- Will describe how to choose the number of iterations.\n", "- Provide a simulation study: How does the integration method affects the solution of the model?\n", "\n", "The following [slides](https://github.com/HumanCapitalAnalysis/student-project-rafael_suchy/blob/master/slides/numerical_integration_ddcm.pdf) provide a gentle introduction into the Keane and Wolpin (1994) model and highlight its main components. Additionally the reason and basic intuition for the usage of Monte Carlo and quasi-Monte Carlo methods are outlined." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import respy as rp\n", "import numpy as np\n", "import chaospy as cp\n", "import matplotlib.pyplot as plt\n", "\n", "from matplotlib import ticker\n", "from _numerical_integration import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical Integration in respy\n", "\n", "The current functionality of **respy** entails two main methods for the numerical approximation of the $EMax(\\cdot)$:\n", "\n", "- Monte Carlo Simulation: Chooses points randomly in the domain \n", "- Quasi Monte Carlo Simulation: Chooses points from one of the two low-discrepancy sequences\n", " - Sobol\n", " - Halton\n", " \n", "A very short introduction about the nature of low-discrepancy sequences is provided in the following [notebook](https://github.com/HumanCapitalAnalysis/student-project-rafael_suchy/blob/master/notebooks/98_low_discrepancy_sequences_application_integration.ipynb).\n", "\n", "Now it is finally time to get our hands on the implementation in **respy**. The main method affected is ``create_base_draws`` within ``shared.py``, which as the name suggests creates a set of draws from the standard normal distribution. The draws are either drawn randomly (Monte Carlo Simulation) or from low-discrepancy sequences (Quasi-Monte Carlo Simulation). Either of the methods is used to calculate the $EMax(\\cdot)$ in the solution *and* the choice probabilities in the maximum likelihood estimation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to choose the Monte Carlo Method in respy\n", "\n", "As mentioned, we will use the Keane and Wolpin (1994) model within this tutorial. First, we retrieve the parametrization. For details, you are welcomed to consult the previous tutorials of this readme. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "params, options = rp.get_example_model(\"kw_94_one\", with_data=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The set of *options* helps to define the underlying model and characterizes the solution methods as well as some functional forms. Inspecting the content we recognize the relevant options `solution_draws` and `monte_carlo_sequence`. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'estimation_draws': 200,\n", " 'estimation_seed': 500,\n", " 'estimation_tau': 500,\n", " 'interpolation_points': -1,\n", " 'n_periods': 40,\n", " 'simulation_agents': 1000,\n", " 'simulation_seed': 132,\n", " 'solution_draws': 500,\n", " 'solution_seed': 1,\n", " 'monte_carlo_sequence': 'random',\n", " 'core_state_space_filters': [\"period > 0 and exp_{i} == period and lagged_choice_1 != '{i}'\",\n", " \"period > 0 and exp_a + exp_b + exp_edu == period and lagged_choice_1 == '{j}'\",\n", " \"period > 0 and lagged_choice_1 == 'edu' and exp_edu == 0\",\n", " \"lagged_choice_1 == '{k}' and exp_{k} == 0\",\n", " \"period == 0 and lagged_choice_1 == '{k}'\"],\n", " 'covariates': {'constant': '1',\n", " 'exp_a_square': 'exp_a ** 2',\n", " 'exp_b_square': 'exp_b ** 2',\n", " 'at_least_twelve_exp_edu': 'exp_edu >= 12',\n", " 'not_edu_last_period': \"lagged_choice_1 != 'edu'\",\n", " 'edu_ten': 'exp_edu == 10'}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "options" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The option `solution_draws` specifies how many points (iteration) are simulated (performed) in order to solve the model. Ceteris paribus, choosing more points leads to a higher precision but also to higher computational costs. The option `monte_carlo_sequence` determines the method.[2](#fn2) As we will see, the implementation of quasi-Monte Carlo methods enables to break the trade-off: It is possible to increase the accuracy for a given set of points, switching from Monte Carlo methods to quasi-Monte Carlo methods.\n", "\n", "It is possible to choose `monte_carlo_sequence` from the set of $\\{$\"random\", \"halton\", \"sobol\" $\\}$. In the following we will use the Sobol sequence with 200 draws to solve the model (`rp.solve`), by changing the option parameters as follows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "options[\"monte_carlo_sequence\"], options[\"solution_draws\"] = [\"sobol\", 200]\n", "state_space = rp.solve(params, options)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To verify, we can shortly check whether the options were changed correctly." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('sobol', 200)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "options[\"monte_carlo_sequence\"], options[\"solution_draws\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Up to this point, all necessary information to change the integration option is mentioned. The eager reader can start to experiment with its own model (or a different Keane and Wolpin specification). In the next section we will provide an example and simulate the Keane and Wolpin (1994) model with different options. We will elaborate on how the employed Monte Carlo methods affects the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Simulation study: How does the integration method affects the solution of the model?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will simulate the model with either of the methods: \"random\", \"halton\", \"sobol\" and vary the amount of points from `POINTS_MIN` to `POINTS_MAX` by increments of (`POINTS_MAX` - `POINTS_MIN`)/`NUM_EVALS` that are used in the solution of the model. A reference value is calculated by using a very large amount of points ``POINTS_TRUTH``. Note, solving the whole model with ``POINTS_TRUTH`` repeatedly would not be possible at all. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# For a real application you might change the points as commented\n", "POINTS_TRUTH = 1_000 # 1_000_000\n", "\n", "POINTS_MIN = 2 # 100 \n", "POINTS_MAX = 4 # 10_000\n", "NUM_EVALS = 10 # 100 \n", "POINTS_GRID = np.logspace(\n", " POINTS_MIN, \n", " POINTS_MAX, \n", " NUM_EVALS, \n", " dtype = int\n", ")\n", "\n", "METHODS = [\"random\", \"halton\", \"sobol\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will compute a reference value for the $EMax(\\cdot)$ values by using a large amount of points with the \"sobol\"-sequence. We extract the \"true\" $EMax(\\cdot)$ values into the object `emax_true`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "