Jianhao Lin, Jiacheng Fan, Yifan Zhang, and Liangyuan Chen, "Real-time Macroeconomic Projection Using Narrative Central Bank Communication", Journal of Applied Econometrics, Vol. 38, No. 2, 2023, pp. 202-221. This file contains information on the data and codes used to obtain the main results in the paper. All files are zipped in the file "lin-fan-zhang-chen-files.zip", which contains 98 files with 106 MB in total. Unix/Linux users should *not* use "unzip -a" because many files are binary. The rest of this readme spells out what is in each directory and how to run the code to generate the results. # Brief Description There are four folders in the zipped file: 1) "Data" contains all the data sets used in the paper with three kinds of data, (i) the raw macroeconomic and textual data (in "Raw_Data"); (ii) the data generated from the model (in "Temp_Data"), e.g., PBC communication indices (PCIs) or macro factors (MFs); and (iii) the processed data used to replicate the main figures or tables (in "Data"), e.g., the forecasting results for six target variables. 2) "Functions" provides three MATLAB functions to estimate the dynamic factor model (DFM) using the Bayesian approach. 3) "Programs" gives the main codes to (i) estimate the Bayesian DFM to get the MFs, (ii) construct the textual PCIs using the hurdle distributed multiple regression (HDMR), (iii) conduct the real-time predicting exercise combing mixed-frequency target variables and predictors, and (iv) replicate the figures and tables in the paper. The softwares used by the code include MATLAB (version 2021a), Julia (version 1.6.1), Python (version 3.8.8) and Stata (version 16). 4) "Results" contains the replications of Figures 2-5 and Tables 4-6 using the code provided in "Programs". Note that the Figure 1 is drawn by hand and provided via a pdf file, Tables 1 and 2 are just summary statistics of the textual or macro data, which is done by the Excel. We now provide more details about the data and how to use them to obtain the main results in this paper. # Data There are two folders ("Raw_Data" and "Temp_Data") and 10 additional files in the "Data" folder. ## Raw_Data The "Raw_Data" folder contains the raw series of the 6 target variables, 151 monthly macroeconomic indicators, and the term-frequency matrix of PBC communication texts with 13,507 unqiue phrases. There are total 11 files in this folder. ### Macroeconomic Data The macro data (all the target varibales and indicators) are downloaded from the China Economic Information Network (CEInet) statistical database from 2003:Q1 to 2019:Q2 for quarterly variables or 2003:M1 tot 2019:M6 for monthly variables. Thus, we have 68 obervations for quarterly variables or 198 for monthly variables in total. Related files are: -->> The six csv files, "Target_GDP.csv", "Target_IVA.csv", "Target_FAI.csv", "Target_CPI.csv", "Target_M2.csv" and "Target_PPI", are the time series of the 6 target variables: quarterly year-on-year real gross domestic product growth rate (GDP), monthly industrial value added growth (IVA), monthly fixed asset investment growth (FAI), monthly consumer price index (CPI), monthly broad money supply growth (M2) and monthly producer price index (PPI), respectively. -->> The excel file, "Macro_Data.xlsx", is the time series of 151 monthly macroeconomic indicators. The first column presents the variable names in Chinese, and their corresponding English could be found in "Tab_2.xlsx" in the "Results" folder. -->> The other two csv files, "Targets_Quarter.csv" and "Targets_Month.csv", just combine the quarterly indices or monthly indices together which are also used in the programming codes. ### Textual Data The People's Bank of China (PBC) communication texts are collected by the authors manually. For written communications, we download all the quarterly Monetary Policy Executive Reports (MPERs) from the PBC's official website (http://www.pbc.gov.cn/en/3688006/index.html). For oral communications, we collect the speeches, press conferences, articles of the governor and interviews by searching from Baidu News. Since the original texts are organized in Chinese, we don't provide the raw texts here. Related files are: -->> The csv file, "PBC_Text_Matrix.csv", is the term-frequency matrix of PBC communication texts. The texts are organized by month. The first row of this file denotes the time index, where 0 represents the first month (2003:M1) and 197 represents the last month (2019:M6). In addition, we have 13,507 unique phrases in total. Thus, we have a matrix with the row size of 13,507 and column size of 198, and each element is the count of a phrase in a month. -->> The excel file, "PBC_Word_Chinese.xlsx", lists the phrases in Chinese after the text preprocessing. The index number of the phrase is the same as the row number in the csv file. ## Temp_Data The "Temp_Data" folder contains the data generated from the model (DFM and HDMR) and forecasting results of individual model (mixed-data sampling, MIDAS). There are total 45 files in this folder. ### Data Generated from the Model Macro factors from DFM, PBC communication indices (PCIs) and phrase coefficients from HDMR are included. Related files are: -->> The two csv files, "MFactors_Fullsample.csv" and "MFactors_Realtime.csv", are the five monthly macro factors based on full-sample and real-time estimations, respectively. Full-sample estimates start at 2003:M1 and end at 2019:M6. Real-time estimates start at 2003:M1 and end at 2009:M12, then we recursively re-estimate the DFM to get the real-time vintages of macro factors. In these files, the first to fifth macro factors are listed in order and the missing values ​​of real-time series are filled with 0. The real-time vintages are just the repetition of these columns recursively. This does not affect the final result, as real-time forecasting does not use data that is not available out of sample (i.e., 2010:M1 - 2019:M6). -->> The csv files, e.g., "PCIs_CPI_Fullsample.csv" and "PCIs_CPI_Realtime.csv", are the monthly PCIs with CPI as target variable based on full-sample and real-time estimations, respectively. The estimation procedure is the same as that of macro factors described above. In each file, the first four columns are word-repetition PCI, word-inclusion PCI, total unique phrases (inclusion) of the text and total phrases (repetition) of the text, respectively. In addition, the csv file, "PCIs_CPI_Realtime_Nowcast.csv", contains the real-time data of PCIs and the 1-month-ahead estimate of PCI used in the nowcasting exercise. The method that we refine HDMR to compute the estimates of PCI are referred to subsection 2.2 in the paper. The csv file, "PCIs_CPI_Realtime_OneColumn.csv", extract the values ​​of the out-of-sample (OOS) real-time series into a 198-dimensional column vector, for the convenience of plotting Figure 5. There are also PCIs with other target variables, i.e., GDP, IVA, FAI, M2 and PPI. -->> The csv file, "Coef_CPI_Repetition_Realtime.csv", reports the phrase coefficients of word-repetition PCI of CPI as target variable. The PCI is estimated in real time, hence, the coefficients are also updated in real time as presented in the columns. Similarly, the csv file, "Coef_CPI_Repetition_Realtime.csv", reports the phrase coefficients of word-repetition PCI of CPI as target variable. There are also phrase coefficients of PCIs with other target variables, i.e., GDP, IVA, FAI, M2 and PPI. --->> The Excel file, "Pivotal_Phrase_Realtime.xlsx", presents the importance of each phrase (13,507 phrases in total). The aim of this file is to obtain the important phrases in the real-time OOS fit, and the method to compute phrase importance are referred to the subsection F.3 in the Appendix. There are four columns in this file. The first column "Word" is the word in Chinese, the second column "Phrase_imp" is the value of phrase importance, the third column "Count" is the total phrase count in the whole sample, and the last "Inclusion" is the total occurrence (measured in percentages) of a phrase in the sample. There are 12 sheets reporting the phrase importance of 12 PCIs. "GDP", "CPI", etc., denote the target variables, and "Repetition" ("Inclusion") denotes the word-repetition (inclusion) index. ### Forecasting Results of Individual Model Forecasting results of individual MIDAS with one high-frequency predictor in the regression are provided here. The aim of listing these results are to obtain the optimal lag order combination of predictors. Related files are: -->> The csv file, "Forecast_CPI_Realtime.csv", are the forecasting results of 17 individual regressions with CPI as target variable. The high-frequency predictors are 5 macro factors and 12 PCIs (see Table 3 for the model specification). In this file, "Ylag" is the lag order of low-frequency target variable"Xlag" is the lag order of high-frequency predictor, "df", "zpos" and "zzero" denotes the macro factor, word-repetition PCI and word-inclusion PCI as the predictor, respectively. Root mean square error (RMSE) of each model are provided. The OOS forecasting begins at 2010:M1. The first sheet, "rec_now", reports the nowcasting result, and the remaining sheets report the forecasting results where the number represents the forecast horizon. The in-sample fitted values are padded with 0. There are also forecasting results for other target variables, i.e., GDP, IVA, FAI, M2 and PPI. ## Remaining Data There are 12 remaining files in the "Data" folder, containing (i) forecasting results of combination model, (ii) professional forecast (PF) and (iii) other files used in drawing the figures. All these files are organized in Excel. -->> The excel file, "Forecast_CPI_Best_Realtime.xlsx", reports the model combination of individual MIDAS regressions with the optimal lag order for each predictor. The combination models and the meanings of the column names are illustrated in Table 3, where "mf" denotes the benchmark model (MF), "zp" denotes the combination of MF and the forecast of word-repetition index (MF + PCI (z+)), ""zz" denotes the combination of MF and the forecast of word-inclusion index (MF + PCI (z0)), "cb" denotes the combination of MF and the forecasts of word-repetition and inclusion indices (MF + PCI (z+ & z0)) and "mfcb" dentoes the combination of MF and 12 PCIs. Notations "ew", "bicw" and "msfew" denote the equal, Bayesian information criterion (BIC) and mean squared forecast error (MSFE) weight schemes, respectively. The OOS forecasting begins at 2010:M1. The first sheet, "rec_now", reports the nowcasting result, and the remaining sheets report the forecasting results where the number represents the forecast horizon. The in-sample fitted values are padded with 0. There are also forecasting results for other target variables, i.e., GDP, IVA, FAI, M2 and PPI. -->> The excel file, "Forecast_Text_PF.xlsx", reports the professional forecasts (PFs) of GDP, IVA, CPI, M2 and PPI. PF is given from 2010:Q1 or 2010:M1. The column "Mean" is the mean forecast of the professors, the column "Max" is the maximum forecast and the column "Min" is the minimum forecast. The remaining columns show the best real-time forecasts generated from the statistical models used in this paper. -->> The excel file, "Oral_Frequency.xlsx", records the quarterly occurrence of four kinds of oral communication, i.e., speech, press, article and interview. The sample period is 2003:Q1 to 2019:Q2. Figure 2 is drawn by using the data in this file. -->> The excel files, "PCIs_Fullsample.xlsx" and "PCIs_Fullsample_Realtime.xlsx", combine the PCIs and target variables. The real-time PCIs are convenient to display by using one column data from the "Temp_Data" folder. The sample period is 2003:Q1 (or 2003:M1) to 2019:Q2 (or 2019:M6). # Programs and Results There are 14 programming codes in the "Programs" folder, which gives the main results in the paper in the "Results" folder. To get the final results, we have 4 steps: ## Step 1: Estimate the Bayes DFM to Generate the Macro Factors 1) Run the MATLAB code "Step_1A_Bayes_DFM_Fullsample.m" ==>> Full-sample estimation of macro factors (location: "Data\Temp_Data\MFactors_Fullsample.csv") 2) Run the MATLAB code "Step_1B_Bayes_DFM_Realtime.m" ==>> Real-time estimation of macro factors (location: "Data\Temp_Data\MFactors_Realtime.csv") ## Step 2: Estimate the HDMR to Generate the PCIs 1) Run the Julia code "Step_2A_Construct_PCIs_Fullsample.ipynb" ==>> Full-sample estimation of PCIs (location: "Data\Temp_Data\PCIs_CPI_Fullsample.csv", etc.) 2) Run the Julia code "Step_2B_Construct_PCIs_Realtime.ipynb" ==>> Real-time estimation of PCIs used in nowcasting (location: "Data\Temp_Data\PCIs_CPI_Realtime_Nowcast.csv", etc.) 3) Run the Julia code "Step_2C_Transform_PCIs_Realtime.ipynb" ==>> Transform real-time PCIs used in forecasting (location: "Data\Temp_Data\PCIs_CPI_Realtime.csv", etc.) and one-column PCIs used to draw Figure 5 (location: "Data\Temp_Data\PCIs_CPI_Realtime_OneColumn.csv", etc.) 4) Run the Julia code "Step_2D_Pivotal_Phrases_Realtime.ipynb" ==>> Compute phrase importance (location: "Data\Temp_Data\"Pivotal_Phrase_Realtime.xlsx", etc.) and generate Table 5 (location: "Results\Tab_5.xlsx") ## Step 3: Estimate the MIDAS Model to Generate the Real-time OOS Forecasts 1) Run the R code "Step_3A_Forecast_GDP_Realtime.R" ==>> Real-time OOS forecasting results of the best combination model for GDP (location: "Data\Forecast_GDP_Best_Realtime.xlsx") 2) Run the R code "Step_3B_Forecast_IVA_Realtime.R" ==>> Real-time OOS forecasting results of the best combination model for IVA (location: "Data\Forecast_IVA_Best_Realtime.xlsx") 3) Run the R code "Step_3C_Forecast_FAI_Realtime.R" ==>> Real-time OOS forecasting results of the best combination model for FAI (location: "Data\Forecast_FAI_Best_Realtime.xlsx") 4) Run the R code "Step_3D_Forecast_CPI_Realtime.R" ==>> Real-time OOS forecasting results of the best combination model for CPI (location: "Data\Forecast_CPI_Best_Realtime.xlsx") 5) Run the R code "Step_3E_Forecast_M2_Realtime.R" ==>> Real-time OOS forecasting results of the best combination model for M2 (location: "Data\Forecast_M2_Best_Realtime.xlsx") 6) Run the R code "Step_3F_Forecast_PPI_Realtime.R" ==>> Real-time OOS forecasting results of the best combination model (location: "Data\Forecast_PPI_Best_Realtime.xlsx") ## Step 4: Replicates the Main Results in the Paper 1) Run the Stata do file "Step_4A_Figures.do" ==>> Figures 2-5 (location: "Results\Fig_2.pdf", etc.) 2) Run the Stata do file "Step_4B_Tables.do" ==>> Tables 4, 6 (location: "Results\Tab_4.xlsx", etc.)