hostcrafts.blogg.se - Stata drop duplicates

The codes not only remove duplicate GVKEY-DATADATE pairs but also remove duplicate GVKEY-FYQ (or GVKEY-CYQ) pairs, which will then allow us to use the tsset command and do lag and change calculations in Stata, e.g., to get beginning-of-quarter total assets or calculate seasonal changes in sales. By the way, these are the cleverest Stata codes I’ve ever written 🙂 The codes also create a new variable FYQ (or CYQ) that represents fiscal year and quarter (or calendar year and quarter). Suppose we have converted FUNDQ raw file to Stata format, the following Stata codes will implement the above strategy and save the results in FUNDQ_NODUP. This means, in the above example, we should delete the following observations: In my opinion, the best strategy is to retain the GVKEY-DATADATE that has reflected the most recent change of fiscal year end. Unfortunately, I find that the suggested rule is not the best strategy, because COMPUSTAT seems to set DATAFQTR as missing or non-missing in an inconsistent way. Rule: Select records from the co_idesind data group where datafqtr is not null, to view as fiscal data. Compustat delivers those multiple records with the same datadate but each record relates to a different fiscal year-end period. Note: Companies that undergo a fiscal-year change may have multiple records with the same datadate.

In the definition of DATAFQTR, COMPUSTAT notes that, That said, if we want to remove duplicates in a more careful way, COMPUSTAT gives the following clue:

This suggests that no matter how we deal with duplicates, even simply delete all of them, our results probably won’t change in a noticeable way. So, what’s the best strategy to remove duplicate GVKEY-DATADATE pairs?īefore we answer this question, let’s take a closer look at duplicate GVKEY-DATADATE pairs in FUNDQ, which reveals that 99.8% of GVKEY-DATADATE pairs in FUNDQ are unique as of December 5, 2107. In contrast, the observation for fiscal 2010Q1 indicates NIQ = NIY, because both are single-quarter net income in this case. Next, if we compare NIQ and NIY as highlighted in the red rectangle, the observation for fiscal 2009Q4 indicates NIY > NIQ, which makes sense because NIY is four-quarter sum and NIQ is single-quarter net income. FUNDQ also reports additional duplicates for the subsequent two quarters (I don’t know why). As a result, appeared twice in FUNDQ, one is fiscal 2009Q4 (based on the old fiscal year end) and the other is 2010Q1 (based on the new fiscal year end). The data suggest that on March 31, 2010, the firm changed its fiscal year end from March 31 to December 31 (i.e., FYR changed from 3 to 12). In this example, duplicates exist for three DATATEs:, , and. Variable definition: FYEARQ – fiscal year FQTR – fiscal quarter FYR – fiscal year end month DATACQTR – calendar quarter DATAFQTR – fiscal quarter ATQ – total assets NIQ – quarterly net income NIY – year-to-date net income. I use the following example as an illustration: The root cause of duplicate GVKEY-DATADATE pairs is a firm changing its fiscal year end. But duplicate GVKEY-DATADATE pairs still exist in FUNDQ. In fact, 99.7% observations in FUNDQ have already satisfied these conditions. First of all, applying the same conditions won’t work. The quarterly data (FUNDQ) is a little bit complicated. This command will return “no duplicates”. If we have converted FUNDA to Stata format, the uniqueness of GVKEY-DATADATE can be verified using the following Stata command:ĭuplicates report gvkey datadate if indfmt="INDL" & datafmt="STD" & popsrc="D" & consol="C"

Indfmt="INDL" & datafmt="STD" & popsrc="D" & consol="C" The annual data (FUNDA) is easy to deal with, we just need to apply the following conditions: