Part 1

Problem: The 13F files does not contain TICKER SYMBOL which acts as standard identifers for the asset.

Solve: SEC contains other (not 13F) datasets that have TICKER SYMBOLS and CUSIPS. CUSIPS are avaiabile in the 13F files so we can join them together.

"Fails-to Deliver" dataset contains SYMBOL AND CUSIPS.

fail_deliver-2.jpg

On SEC, the file name and download links have a discerable pattern so lets organize a table with 2 columns: name, url via a dataframe or excel.

Download the files into a folder.

Loop through each file, unzip it (using the package ZipFile), and concatenate it into a single dataframe.

Loop through each item:

Download and unzip the data

Append the data onto a dataframe called symboldf

We have all the Fail to Deliver data. All 1.2m rows of it.

Remove duplicates and only keep the relevant columns, CUSIP and SYMBOL.

Part 2

Download the 13F quarterly data (Mar, June, Sept, Dec) from SEC

Each quarterly file contains 8 TSV files

Study SEC data dictionary for info on each file

For this excersice, I use the SUBMISSION, COVERPAGE & INFOTABLE.

sec_files.JPG

Join each using the function pd.merge on the primary identifier 'ACCESSION_NUMBER'

Example:

sub_cover_june = pd.merge(sub_june, cover_june, on='ACCESSION_NUMBER', how='inner')

sub_cover_it_june = pd.merge(sub_cover_june, it_june, on='ACCESSION_NUMBER', how='inner')

Recap:

We have full year 13F data aggregated in dataframe x

We have CUSIP and SYMBOL info in the dataframe symboldf

The last step is merging the two data together.

Finally, we have a dataset of all holdings filed with the SEC with the SYMBOL column which provides a unified and simple way to query the underlying assets.