Jianhao Lin, Jiacheng Fan, Yifan Zhang, and Liangyuan Chen, "Real-time
Macroeconomic Projection Using Narrative Central Bank Communication",
Journal of Applied Econometrics, Vol. 38, No. 2, 2023, pp. 202-221.

This file contains information on the data and codes used to obtain
the main results in the paper. All files are zipped in the file
"lin-fan-zhang-chen-files.zip", which contains 98 files with 106 MB in
total. Unix/Linux users should *not* use "unzip -a" because many files
are binary. The rest of this readme spells out what is in each
directory and how to run the code to generate the results. 

# Brief Description

There are four folders in the zipped file:

1) "Data" contains all the data sets used in the paper with three
kinds of data, (i) the raw macroeconomic and textual data (in
"Raw_Data"); (ii) the data generated from the model (in "Temp_Data"),
e.g., PBC communication indices (PCIs) or macro factors (MFs); and
(iii) the processed data used to replicate the main figures or tables
(in "Data"), e.g., the forecasting results for six target variables.

2) "Functions" provides three MATLAB functions to estimate the dynamic
factor model (DFM) using the Bayesian approach.

3) "Programs" gives the main codes to (i) estimate the Bayesian DFM to
get the MFs, (ii) construct the textual PCIs using the hurdle
distributed multiple regression (HDMR), (iii) conduct the real-time
predicting exercise combing mixed-frequency target variables and
predictors, and (iv) replicate the figures and tables in the paper.
The softwares used by the code include MATLAB (version 2021a), Julia
(version 1.6.1), Python (version 3.8.8) and Stata (version 16).

4) "Results" contains the replications of Figures 2-5 and Tables 4-6
using the code provided in "Programs". Note that the Figure 1 is drawn
by hand and provided via a pdf file, Tables 1 and 2 are just summary
statistics of the textual or macro data, which is done by the Excel.


We now provide more details about the data and how to use them to
obtain the main results in this paper.


# Data

There are two folders ("Raw_Data" and "Temp_Data") and 10 additional
files in the "Data" folder.


## Raw_Data

The "Raw_Data" folder contains the raw series of the 6 target
variables, 151 monthly macroeconomic indicators, and the
term-frequency matrix of PBC communication texts with 13,507 unqiue
phrases. There are total 11 files in this folder.


### Macroeconomic Data

The macro data (all the target varibales and indicators) are
downloaded from the China Economic Information Network (CEInet)
statistical database from 2003:Q1 to 2019:Q2 for quarterly variables
or 2003:M1 tot 2019:M6 for monthly variables. Thus, we have 68
obervations for quarterly variables or 198 for monthly variables in
total. Related files are:

-->> The six csv files, "Target_GDP.csv", "Target_IVA.csv",
"Target_FAI.csv", "Target_CPI.csv", "Target_M2.csv" and "Target_PPI",
are the time series of the 6 target variables: quarterly year-on-year
real gross domestic product growth rate (GDP), monthly industrial
value added growth (IVA), monthly fixed asset investment growth (FAI),
monthly consumer price index (CPI), monthly broad money supply growth
(M2) and monthly producer price index (PPI), respectively.

-->> The excel file, "Macro_Data.xlsx", is the time series of 151
monthly macroeconomic indicators. The first column presents the
variable names in Chinese, and their corresponding English could be
found in "Tab_2.xlsx" in the "Results" folder.

-->> The other two csv files, "Targets_Quarter.csv" and
"Targets_Month.csv", just combine the quarterly indices or monthly
indices together which are also used in the programming codes.

### Textual Data

The People's Bank of China (PBC) communication texts are collected by
the authors manually. For written communications, we download all the
quarterly Monetary Policy Executive Reports (MPERs) from the PBC's
official website (http://www.pbc.gov.cn/en/3688006/index.html). For
oral communications, we collect the speeches, press conferences,
articles of the governor and interviews by searching from Baidu News.
Since the original texts are organized in Chinese, we don't provide
the raw texts here. Related files are:

-->> The csv file, "PBC_Text_Matrix.csv", is the term-frequency
matrix of PBC communication texts. The texts are organized by month.
The first row of this file denotes the time index, where 0 represents
the first month (2003:M1) and 197 represents the last month (2019:M6).
In addition, we have 13,507 unique phrases in total. Thus, we have a
matrix with the row size of 13,507 and column size of 198, and each
element is the count of a phrase in a month.
 
-->> The excel file, "PBC_Word_Chinese.xlsx", lists the phrases in
Chinese after the text preprocessing. The index number of the phrase
is the same as the row number in the csv file.


## Temp_Data

The "Temp_Data" folder contains the data generated from the model (DFM
and HDMR) and forecasting results of individual model (mixed-data
sampling, MIDAS). There are total 45 files in this folder.

### Data Generated from the Model

Macro factors from DFM, PBC communication indices (PCIs) and phrase
coefficients from HDMR are included.  Related files are:

-->> The two csv files, "MFactors_Fullsample.csv" and
"MFactors_Realtime.csv", are the five monthly macro factors based on
full-sample and real-time estimations, respectively. Full-sample
estimates start at 2003:M1 and end at 2019:M6. Real-time estimates
start at 2003:M1 and end at 2009:M12, then we recursively re-estimate
the DFM to get the real-time vintages of macro factors. In these
files, the first to fifth macro factors are listed in order and the
missing values ​​of real-time series are filled with 0. The real-time
vintages are just the repetition of these columns recursively. This
does not affect the final result, as real-time forecasting does not
use data that is not available out of sample (i.e., 2010:M1 -
2019:M6).

-->> The csv files, e.g., "PCIs_CPI_Fullsample.csv" and
"PCIs_CPI_Realtime.csv", are the monthly PCIs with CPI as target
variable based on full-sample and real-time estimations, respectively.
The estimation procedure is the same as that of macro factors
described above. In each file, the first four columns are
word-repetition PCI, word-inclusion PCI, total unique phrases
(inclusion) of the text and total phrases (repetition) of the text,
respectively. In addition, the csv file,
"PCIs_CPI_Realtime_Nowcast.csv", contains the real-time data of PCIs
and the 1-month-ahead estimate of PCI used in the nowcasting exercise.
The method that we refine HDMR to compute the estimates of PCI are
referred to subsection 2.2 in the paper. The csv file,
"PCIs_CPI_Realtime_OneColumn.csv", extract the values ​​of the
out-of-sample (OOS) real-time series into a 198-dimensional column
vector, for the convenience of plotting Figure 5. There are also PCIs
with other target variables, i.e., GDP, IVA, FAI, M2 and PPI.

-->> The csv file, "Coef_CPI_Repetition_Realtime.csv", reports the
phrase coefficients of word-repetition PCI of CPI as target variable.
The PCI is estimated in real time, hence, the coefficients are also
updated in real time as presented in the columns. Similarly, the csv
file, "Coef_CPI_Repetition_Realtime.csv", reports the phrase
coefficients of word-repetition PCI of CPI as target variable. There
are also phrase coefficients of PCIs with other target variables,
i.e., GDP, IVA, FAI, M2 and PPI.

--->> The Excel file, "Pivotal_Phrase_Realtime.xlsx", presents the
importance of each phrase (13,507 phrases in total). The aim of this
file is to obtain the important phrases in the real-time OOS fit, and
the method to compute phrase importance are referred to the subsection
F.3 in the Appendix. There are four columns in this file. The first
column "Word" is the word in Chinese, the second column "Phrase_imp"
is the value of phrase importance, the third column "Count" is the
total phrase count in the whole sample, and the last "Inclusion" is
the total occurrence (measured in percentages) of a phrase in the
sample. There are 12 sheets reporting the phrase importance of 12
PCIs. "GDP", "CPI", etc., denote the target variables, and
"Repetition" ("Inclusion") denotes the word-repetition (inclusion)
index.


### Forecasting Results of Individual Model

Forecasting results of individual MIDAS with one high-frequency
predictor in the regression are provided here. The aim of listing
these results are to obtain the optimal lag order combination of
predictors. Related files are:

-->> The csv file, "Forecast_CPI_Realtime.csv", are the forecasting
results of 17 individual regressions with CPI as target variable. The
high-frequency predictors are 5 macro factors and 12 PCIs (see Table 3
for the model specification). In this file, "Ylag" is the lag order of
low-frequency target variable"Xlag" is the lag order of high-frequency
predictor, "df", "zpos" and "zzero" denotes the macro factor,
word-repetition PCI and word-inclusion PCI as the predictor,
respectively. Root mean square error (RMSE) of each model are
provided. The OOS forecasting begins at 2010:M1. The first sheet,
"rec_now", reports the nowcasting result, and the remaining sheets
report the forecasting results where the number represents the
forecast horizon. The in-sample fitted values are padded with 0. There
are also forecasting results for other target variables, i.e., GDP,
IVA, FAI, M2 and PPI. 


## Remaining Data

There are 12 remaining files in the "Data" folder, containing (i)
forecasting results of combination model, (ii) professional forecast
(PF) and (iii) other files used in drawing the figures. All these
files are organized in Excel.

-->> The excel file, "Forecast_CPI_Best_Realtime.xlsx", reports the
model combination of individual MIDAS regressions with the optimal lag
order for each predictor. The combination models and the meanings of
the column names are illustrated in Table 3, where "mf" denotes the
benchmark model (MF), "zp" denotes the combination of MF and the
forecast of word-repetition index (MF + PCI (z+)), ""zz" denotes the
combination of MF and the forecast of word-inclusion index (MF + PCI
(z0)), "cb" denotes the combination of MF and the forecasts of
word-repetition and inclusion indices (MF + PCI (z+ & z0)) and "mfcb"
dentoes the combination of MF and 12 PCIs. Notations "ew", "bicw" and
"msfew" denote the equal, Bayesian information criterion (BIC) and
mean squared forecast error (MSFE) weight schemes, respectively. The
OOS forecasting begins at 2010:M1. The first sheet, "rec_now", reports
the nowcasting result, and the remaining sheets report the forecasting
results where the number represents the forecast horizon. The
in-sample fitted values are padded with 0. There are also forecasting
results for other target variables, i.e., GDP, IVA, FAI, M2 and PPI. 

-->> The excel file, "Forecast_Text_PF.xlsx", reports the
professional forecasts (PFs) of GDP, IVA, CPI, M2 and PPI. PF is given
from 2010:Q1 or 2010:M1. The column "Mean" is the mean forecast of the
professors, the column "Max" is the maximum forecast and the column
"Min" is the minimum forecast. The remaining columns show the best
real-time forecasts generated from the statistical models used in this
paper.

-->> The excel file, "Oral_Frequency.xlsx", records the quarterly
occurrence of four kinds of oral communication, i.e., speech, press,
article and interview. The sample period is 2003:Q1 to 2019:Q2. 
Figure 2 is drawn by using the data in this file. 

-->> The excel files, "PCIs_Fullsample.xlsx" and
"PCIs_Fullsample_Realtime.xlsx", combine the PCIs and target
variables. The real-time PCIs are convenient to display by using one
column data from the "Temp_Data" folder. The sample period is 2003:Q1
(or 2003:M1) to 2019:Q2 (or 2019:M6).  


# Programs and Results

There are 14 programming codes in the "Programs" folder, which gives
the main results in the paper in the "Results" folder. To get the
final results, we have 4 steps:


## Step 1: Estimate the Bayes DFM to Generate the Macro Factors

1) Run the MATLAB code "Step_1A_Bayes_DFM_Fullsample.m" ==>>
Full-sample estimation of macro factors (location:
"Data\Temp_Data\MFactors_Fullsample.csv")

2) Run the MATLAB code "Step_1B_Bayes_DFM_Realtime.m" ==>> Real-time
estimation of macro factors (location:
"Data\Temp_Data\MFactors_Realtime.csv")


## Step 2: Estimate the HDMR to Generate the PCIs

1) Run the Julia code "Step_2A_Construct_PCIs_Fullsample.ipynb" ==>>
Full-sample estimation of PCIs (location:
"Data\Temp_Data\PCIs_CPI_Fullsample.csv", etc.)

2) Run the Julia code "Step_2B_Construct_PCIs_Realtime.ipynb" ==>>
Real-time estimation of PCIs used in nowcasting (location:
"Data\Temp_Data\PCIs_CPI_Realtime_Nowcast.csv", etc.)
 
3) Run the Julia code "Step_2C_Transform_PCIs_Realtime.ipynb" ==>>
Transform real-time PCIs used in forecasting (location:
"Data\Temp_Data\PCIs_CPI_Realtime.csv", etc.) and one-column PCIs used
to draw Figure 5 (location:
"Data\Temp_Data\PCIs_CPI_Realtime_OneColumn.csv", etc.)

4) Run the Julia code "Step_2D_Pivotal_Phrases_Realtime.ipynb" ==>>
Compute phrase importance (location:
"Data\Temp_Data\"Pivotal_Phrase_Realtime.xlsx", etc.) and generate
Table 5 (location: "Results\Tab_5.xlsx")


## Step 3: Estimate the MIDAS Model to Generate the Real-time OOS
Forecasts

1) Run the R code "Step_3A_Forecast_GDP_Realtime.R" ==>> Real-time OOS
forecasting results of the best combination model for GDP (location:
"Data\Forecast_GDP_Best_Realtime.xlsx")

2) Run the R code "Step_3B_Forecast_IVA_Realtime.R" ==>> Real-time OOS
forecasting results of the best combination model for IVA (location:
"Data\Forecast_IVA_Best_Realtime.xlsx")

3) Run the R code "Step_3C_Forecast_FAI_Realtime.R" ==>> Real-time OOS
forecasting results of the best combination model for FAI (location:
"Data\Forecast_FAI_Best_Realtime.xlsx")

4) Run the R code "Step_3D_Forecast_CPI_Realtime.R" ==>> Real-time OOS
forecasting results of the best combination model for CPI (location:
"Data\Forecast_CPI_Best_Realtime.xlsx")

5) Run the R code "Step_3E_Forecast_M2_Realtime.R" ==>> Real-time OOS
forecasting results of the best combination model for M2 (location:
"Data\Forecast_M2_Best_Realtime.xlsx")

6) Run the R code "Step_3F_Forecast_PPI_Realtime.R" ==>> Real-time OOS
forecasting results of the best combination model (location:
"Data\Forecast_PPI_Best_Realtime.xlsx")


## Step 4: Replicates the Main Results in the Paper

1) Run the Stata do file "Step_4A_Figures.do" ==>> Figures 2-5
(location: "Results\Fig_2.pdf", etc.)

2) Run the Stata do file "Step_4B_Tables.do" ==>> Tables 4, 6
(location: "Results\Tab_4.xlsx", etc.)