Skip to content

Commit d376d8b

Browse files
Add Datasets class, DownloadApi, and bulk dataset download/sync support
- Add Datasets class with get_all(), get_dataset_details(), show_all(), show_dataset_details(), download(), and sync() methods - Downloads are atomic (.tmp + rename) with incremental sync support — only new or updated containers are re-downloaded based on file size - Helpful error on misspelled dataset names listing all available datasets - Add DownloadApi class (alias for RenderApi, backward compatible) - Update filing download endpoint to edgar-mirror.sec-api.io - Add Quick Start section, Bulk Datasets docs with download/sync examples - Add tests for Datasets, DownloadApi (42 tests total)
1 parent a576af5 commit d376d8b

File tree

6 files changed

+388
-6
lines changed

6 files changed

+388
-6
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ sec_api/__pycache__
1111
requirements.txt
1212
tmp
1313
.superseded
14-
TODO.md
14+
TODO.md
15+
.claude

README.md

Lines changed: 177 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -499,9 +499,9 @@ Variants such as `ConsolidatedStatementsofOperations` or `ConsolidatedStatements
499499
There are 3 ways to convert XBRL to JSON:
500500

501501
- `htm_url`: Provide the URL of the filing ending with `.htm`.
502-
Example URL: https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm
502+
Example URL: [sec.gov/.../tsla-10k_20201231.htm](https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm)
503503
- `xbrl_url`: Provide the URL of the XBRL file ending with `.xml`. The XBRL file URL can be found in the `dataFiles` array returned by our query API. The array item has the description `EXTRACTED XBRL INSTANCE DOCUMENT` or similar.
504-
Example URL: https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231_htm.xml
504+
Example URL: [sec.gov/.../tsla-10k_20201231_htm.xml](https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231_htm.xml)
505505
- `accession_no`: Provide the accession number of the filing, e.g. `0001564590-21-004599`
506506

507507
```python
@@ -775,14 +775,187 @@ Download complete datasets for offline analysis and large-scale processing. All
775775
| Form 10-Q - Quarterly Reports | 10-Q, 10-Q/A | 1993-present | ZIP (HTML, TXT) |
776776
| Form 8-K Exhibit 99 - Press Releases | 8-K, 8-K/A | 1994-present | ZIP (HTML, TXT, PDF) |
777777
| Earnings Results (Item 2.02) | 8-K, 8-K/A | 2004-present | ZIP (HTML, TXT, PDF) |
778-
| Form 3 - Initial Ownership | 3, 3/A | 2009-present | JSONL |
779778
| Form 4 - Changes in Ownership | 4, 4/A | 2009-present | JSONL |
780-
| Form 5 - Annual Ownership | 5, 5/A | 2009-present | JSONL |
781779
| Form 13F - Institutional Holdings | 13F-HR, 13F-HR/A | 2013-present | JSONL |
782780
| Form N-PORT - Fund Holdings | NPORT, NPORT/A | 2019-present | JSONL |
783781
| Form DEF 14A - Proxy Statements | DEF 14A | 1994-present | ZIP (HTML, TXT) |
784782
| [View all datasets...](https://sec-api.io/datasets) | | | |
785783

784+
### Download a Dataset
785+
786+
Downloads are atomic (written to a `.tmp` file first, renamed on completion), so interrupted downloads are automatically resumed on the next run. Only new or updated files are downloaded — existing files are skipped if their size matches the remote. This makes it easy to keep a local copy of any dataset in sync with a single line of code.
787+
788+
```python
789+
from sec_api import Datasets
790+
791+
datasets = Datasets(api_key="YOUR_API_KEY")
792+
793+
# first run: downloads all containers to ./sec-api-datasets/form-10k-content/
794+
datasets.download("form-10k-content")
795+
796+
# subsequent runs: only downloads new or updated containers, skips the rest
797+
datasets.sync("form-10k-content")
798+
799+
# specify a custom directory
800+
datasets.download("form-10k-content", path="./my-data/form-10k-content")
801+
```
802+
803+
Alternatively, download the entire dataset as a single ZIP file:
804+
805+
```python
806+
datasets.download("form-10k-content", strategy="zip")
807+
```
808+
809+
Set up a daily cron job or scheduled task to keep your local dataset up to date:
810+
811+
```python
812+
# sync.py - run daily via cron, e.g.: 0 6 * * * python sync.py
813+
from sec_api import Datasets
814+
815+
datasets = Datasets(api_key="YOUR_API_KEY")
816+
datasets.sync("form-10k-content", path="./my-data/form-10k-content")
817+
```
818+
819+
### List Available Datasets
820+
821+
```python
822+
from sec_api import Datasets
823+
824+
datasets = Datasets()
825+
826+
# no API key required - returns raw JSON list
827+
all_datasets = datasets.get_all()
828+
```
829+
830+
<details>
831+
<summary>Example Response (shortened)</summary>
832+
833+
```json
834+
[
835+
{
836+
"datasetId": "1f11ba9b-e03a-6950-a464-a23fcc53ee6f",
837+
"datasetIdInUrl": "audit-fees",
838+
"name": "Audit Fees",
839+
"description": "Structured dataset of annual audit fees extracted from SEC filings...",
840+
"formTypes": ["DEF 14A"],
841+
"containerFormat": ".jsonl.gz",
842+
"fileTypes": ["JSONL"],
843+
"updatedAt": "2026-04-09T05:00:01.000Z",
844+
"earliestSampleDate": "2001-03-01",
845+
"totalRecords": null,
846+
"totalSize": 9792910
847+
},
848+
{
849+
"datasetId": "1f12abbc-262c-65a0-8b3e-1288c41dcc76",
850+
"datasetIdInUrl": "earnings-results-form-8-k-item-2-02",
851+
"name": "Earnings Results - Form 8-K, Item 2.02 (2004-Present)",
852+
"description": "The Form 8-K Item 2.02 Results Dataset contains all disclosures filed on EDGAR...",
853+
"formTypes": ["8-K", "8-K/A"],
854+
"containerFormat": "ZIP",
855+
"fileTypes": ["HTML", "JSON", "TXT", "GIF", "JPG", "PDF"],
856+
"updatedAt": "2026-04-09T07:07:44.885Z",
857+
"earliestSampleDate": "2004-08-01",
858+
"totalRecords": 2242018,
859+
"totalSize": 154607640756
860+
}
861+
]
862+
```
863+
864+
</details>
865+
866+
<br>
867+
868+
Or use `show_all()` for formatted terminal output:
869+
870+
```python
871+
datasets.show_all()
872+
```
873+
874+
```
875+
ID Name Format Size
876+
────────────────────────────────────────────────── ─────────────────────────────────────────────────────── ────────── ────────────
877+
audit-fees Audit Fees .jsonl.gz 9.8 MB
878+
earnings-results-form-8-k-item-2-02 Earnings Results - Form 8-K, Item 2.02 (2004-Present) ZIP 154.6 GB
879+
form-10k-content Form 10-K - Annual Reports - Filing Contents ZIP 33.8 GB
880+
form-4 Form 4 – Statement of Changes in Beneficial Ownership .jsonl.gz 912.2 MB
881+
...
882+
883+
28 datasets available. Browse all at https://sec-api.io/datasets
884+
```
885+
886+
### Get Dataset Details
887+
888+
```python
889+
# returns raw JSON dict
890+
details = datasets.get_dataset_details("form-10k-content")
891+
```
892+
893+
<details>
894+
<summary>Example Response (shortened)</summary>
895+
896+
```json
897+
{
898+
"datasetId": "1f11bb55-d58b-6080-bace-e7a62567f4b9",
899+
"datasetDownloadUrl": "https://api.sec-api.io/datasets/form-10k-content.zip",
900+
"name": "Form 10-K - Annual Reports - Filing Contents",
901+
"description": "HTML and TXT files of all Form 10-K filings published since 1993...",
902+
"updatedAt": "2026-04-09T07:07:57.058Z",
903+
"earliestSampleDate": "1993-10-01",
904+
"totalRecords": 303021,
905+
"totalSize": 33809939825,
906+
"formTypes": [
907+
"10-K",
908+
"10-K/A",
909+
"10-K405",
910+
"10-K405/A",
911+
"10-KSB",
912+
"10-KSB/A",
913+
"10-KT",
914+
"10-KT/A"
915+
],
916+
"containerFormat": "ZIP",
917+
"fileTypes": ["TXT", "JSON", "HTML", "PAPER"],
918+
"containers": [
919+
{
920+
"downloadUrl": "https://api.sec-api.io/datasets/form-10k-content/2026/2026-04.zip",
921+
"key": "2026/2026-04.zip",
922+
"size": 15593008,
923+
"records": 167,
924+
"updatedAt": "2026-04-09T07:07:57.058Z"
925+
},
926+
{
927+
"downloadUrl": "https://api.sec-api.io/datasets/form-10k-content/2026/2026-03.zip",
928+
"key": "2026/2026-03.zip",
929+
"size": 616726590,
930+
"records": 6468,
931+
"updatedAt": "2026-04-02T02:52:01.741Z"
932+
}
933+
]
934+
}
935+
```
936+
937+
</details>
938+
939+
<br>
940+
941+
Or use `show_dataset_details()` for formatted terminal output:
942+
943+
```python
944+
datasets.show_dataset_details("form-10k-content")
945+
```
946+
947+
```
948+
Name: Form 10-K - Annual Reports - Filing Contents
949+
Description: HTML and TXT files of all Form 10-K filings published since 1993...
950+
Updated: 2026-04-09T07:07:57.058Z
951+
Earliest data: 1993-10-01
952+
Form types: 10-K, 10-K/A, 10-K405, 10-K405/A, 10-KSB, 10-KSB/A, 10-KT, 10-KT/A
953+
Format: ZIP
954+
Total records: 303,021
955+
Total size: 33.8 GB
956+
Containers: 390
957+
```
958+
786959
## Form ADV API
787960

788961
Search and access Form ADV data for registered investment advisers, including firm information, individual advisors, direct/indirect owners, private fund data, and brochures.

sec_api/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,9 @@
5858
# EDGAR Index
5959
from sec_api.index import EdgarIndexApi
6060

61+
# Bulk Datasets
62+
from sec_api.index import Datasets
63+
6164
# Other APIs
6265
from sec_api.index import EdgarEntitiesApi
6366
from sec_api.index import MappingApi

0 commit comments

Comments
 (0)