diff --git a/docs/source/index.rst b/docs/source/index.rst
index c6cf09cb..d0d9e7f7 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -28,6 +28,7 @@ This documentation covers PySUS 2.0+.
 
    Installation <installation>
    Data Sources <databases/data-sources>
+   Working with DATASUS data <working-with-datasus-data>
    Tutorials <tutorials>
    API Reference <api>
 
diff --git a/docs/source/working-with-datasus-data.rst b/docs/source/working-with-datasus-data.rst
new file mode 100644
index 00000000..a5e3b4cb
--- /dev/null
+++ b/docs/source/working-with-datasus-data.rst
@@ -0,0 +1,428 @@
+.. _working-with-datasus-data-gotchas--field-semantics:
+
+Working with DATASUS data: gotchas & field semantics
+====================================================
+
+   Practical field notes for anyone building pipelines on DATASUS microdata with PySUS. Every item below is a real trap encountered while processing the SIH-RD, SIH-SP, SIM, SIA-AM, SIA-PA, CNES-EQ
+   and CNES-PF systems for all 27 Brazilian states, 2008–2025 (CNES since 2005). DATASUS does not standardize conventions across systems and does not publish a changelog for its file layouts, so most
+   of these can only be learned the hard way.
+
+.. contents:: On this page
+   :local:
+   :depth: 1
+
+.. _1-the-systems-at-a-glance:
+
+1. The systems at a glance
+--------------------------
+
+============================================= ==================== ========================================= =============== ========================
+System                                        File pattern         Content                                   Granularity     Typical volume
+============================================= ==================== ========================================= =============== ========================
+**SIH-RD** (hospital admissions, reduced AIH) ``RD{UF}{YYMM}.dbc`` Hospitalizations                          Monthly / state ~500K–2M rec/state/yr
+**SIH-SP** (hospital professional services)   ``SP{UF}{YYMM}.dbc`` Secondary professional acts per admission Monthly / state ~1M–5M rec/state/mo
+**SIM** (mortality)                           ``DO{UF}{YYYY}.dbc`` Death certificates                        Annual / state  ~50K–300K rec/state/yr
+**SIA-AM** (outpatient, APAC medications)     ``AM{UF}{YYMM}.dbc`` High-cost drug dispensations              Monthly / state ~100K–500K rec/state/mo
+**SIA-PA** (outpatient production)            ``PA{UF}{YYMM}.dbc`` Outpatient procedures                     Monthly / state **1M–17M+ rec/state/mo**
+**CNES-EQ** (facilities, equipment)           ``EQ{UF}{YYMM}.dbc`` Equipment inventory                       Monthly / state ~50K–200K rec/state/mo
+**CNES-PF** (facilities, professionals)       ``PF{UF}{YYMM}.dbc`` Health professionals                      Monthly / state ~200K–800K rec/state/mo
+============================================= ==================== ========================================= =============== ========================
+
+Each system has its own column names, encodings and date formats. **There is no standardization across them.**
+
+--------------
+
+.. _2-the-dbc-format:
+
+2. The DBC format
+-----------------
+
+``.dbc`` is a proprietary compression used only by DATASUS: internally a dBASE ``.dbf`` file compressed with a variant of PKWare's ``blast``/implode. There is no complete official spec.
+
+-  **SIGSEGV on corrupt files.** The C decompressor (``blast.c``, used by ``pyreaddbc``) can crash the host process with a segmentation fault on corrupt or malformed headers. This kills the process —
+   ``try/except`` cannot catch it, so a single bad file can abort a whole batch. Running the DBC→DBF step in a subprocess (with retries) contains the blast radius to one file.
+-  **Unpredictable expansion ratio.** A 110 MB SIA-PA ``.dbc`` can expand to ~1.1 GB of ``.dbf``. You cannot preallocate based on the compressed size.
+-  **Platform/build sensitivity.** ``pyreaddbc`` compiles C; on Alpine-based Docker images it needs build packages (``gcc``, ``musl-dev``), and ARM (Apple Silicon) can surface additional issues.
+
+.. _3-encoding-always-latin-1:
+
+3. Encoding: always latin-1
+---------------------------
+
+All DATASUS files use **Latin-1 (ISO-8859-1)**, but routinely contain:
+
+-  embedded NUL bytes (``\x00``) inside character fields,
+-  control characters outside the printable range,
+-  raw binary where text is expected (corruption).
+
+Decode defensively on every field, and strip NULs before persisting:
+
+.. code:: python
+
+   value = raw_bytes.decode("latin-1", "replace").strip().replace("\x00", "")
+
+.. warning::
+
+   Reading with ``utf-8`` (Python's default) fails — silently or with ``UnicodeDecodeError`` — on a large fraction of files. Always force ``latin-1``.
+
+.. _4-file-naming-and-the-ftp-layout:
+
+4. File naming and the FTP layout
+---------------------------------
+
+======= ====================== ================ ===========
+System  Name pattern           Example          Year digits
+======= ====================== ================ ===========
+SIH-RD  ``RD{UF}{YY}{MM}.dbc`` ``RDSP2301.dbc`` **2**
+SIH-SP  ``SP{UF}{YY}{MM}.dbc`` ``SPSP2301.dbc`` **2**
+SIM     ``DO{UF}{YYYY}.dbc``   ``DOSP2023.dbc`` **4**
+SIA-AM  ``AM{UF}{YY}{MM}.dbc`` ``AMSP2301.dbc`` **2**
+SIA-PA  ``PA{UF}{YY}{MM}.dbc`` ``PASP2301.dbc`` **2**
+CNES-EQ ``EQ{UF}{YY}{MM}.dbc`` ``EQSP2301.dbc`` **2**
+CNES-PF ``PF{UF}{YY}{MM}.dbc`` ``PFSP2301.dbc`` **2**
+======= ====================== ================ ===========
+
+-  **Two-digit years need a pivot.** SIH/SIA/CNES abbreviate the year, so ``23`` → 2023 requires a pivot (e.g. ``< 50`` → 20xx, else 19xx). **SIM uses four digits** — applying the pivot to SIM
+   produces garbage years.
+-  **Alphabetic version suffix.** Some files carry a suffix: ``RDSP2301a.dbc``, ``RDSP2301b.dbc``. Filename regexes must allow an optional trailing letter, e.g. ``RD{UF}\d{4}[a-zA-Z]?\.dbc``
+   (case-insensitive).
+-  **Shared directories.** SIH-RD and SIH-SP live in the *same* FTP directory, distinguished only by the ``RD``/``SP`` prefix; SIA-AM and SIA-PA likewise share a directory (``AM``/``PA``).
+-  **Different start dates.** SIH/SIA/SIM begin at ``200801_``; CNES begins earlier at ``200508_`` (May 2005). Pre-2008 SIH/SIA data live elsewhere with a different name format.
+
+FTP base: ``ftp://ftp.datasus.gov.br/dissemin/publicos/``
+
+============ =========================
+System       Path
+============ =========================
+SIH (RD, SP) ``SIHSUS/200801_/Dados``
+SIM (DO)     ``SIM/CID10/DORES``
+SIA (AM, PA) ``SIASUS/200801_/Dados``
+CNES-EQ      ``CNES/200508_/Dados/EQ``
+CNES-PF      ``CNES/200508_/Dados/PF``
+============ =========================
+
+.. _5-the-datasus-ftp-server:
+
+5. The DATASUS FTP server
+-------------------------
+
+-  **Anonymous access**, IIS/Windows-style listings, recommended timeout ~120 s.
+-  **Connections drop**, especially on files > 50 MB. Wrap transfers with automatic reconnect + retry rather than failing the whole job on one dropped socket.
+-  **Directory listing is expensive.** The SIA directory holds thousands of files (27 states × 12 months × 20+ years × 2 prefixes). Issue a **single ``LIST``** per session and cache
+   ``{filename: size}`` instead of one ``LIST``/``SIZE`` per state.
+-  **Verify integrity by size.** A cached local file whose byte count differs from the server's is truncated/corrupt — re-download it. Existence alone is not enough (see also the staleness point
+   below).
+-  **Recent months get revised.** DATASUS retroactively updates the most recent months. A naive "already downloaded → skip" cache silently serves stale data. Always re-fetch a trailing window (e.g.
+   the last 6 months) even if a same-named file already exists; older consolidated files can be skipped safely.
+
+.. _6-large-files-and-memory-sia-pa:
+
+6. Large files and memory (SIA-PA)
+----------------------------------
+
+SIA-PA is orders of magnitude larger than the other systems:
+
+========== ============= ============== ============
+System     ``.dbc`` size ``.dbf`` size  Records/file
+========== ============= ============== ============
+SIH        5–30 MB       50–300 MB      200K–1.5M
+SIM        2–15 MB       20–150 MB      50K–300K
+SIA-AM     5–40 MB       50–400 MB      100K–500K
+**SIA-PA** **50–110 MB** **0.5–1.1 GB** **5M–17M+**
+========== ============= ============== ============
+
+Loading one São Paulo SIA-PA file into pandas to then filter can consume 8–16 GB and OOM. The fix is to **filter before materializing**: scan the fixed-width records at the byte level on just the
+column(s) you filter on (e.g. the CID columns), collect matching row indices, and only build a DataFrame from those.
+
+Observed on a real file (SIA-PA, São Paulo, 2023-01, 17.2M records → 1,284 rows with neurological CIDs): full read + filter ≈ **12 min / 14 GB**; byte-level pre-filter ≈ **45 s / 0.8 GB**.
+
+Other practical guards: batch inserts (e.g. 5,000 rows) with savepoints so one bad batch does not roll back the rest, and generous worker timeouts for full 27-state pulls.
+
+.. _7-column-layouts-drift-across-years:
+
+7. Column layouts drift across years
+------------------------------------
+
+DATASUS does not version file layouts. Columns are added, renamed or removed silently between years. An ETL written against one year's files may break — months later — on another year's. Audited
+deltas between the oldest and newest available files:
+
+-  **SIA-AM:** 50 → 51 columns (``AP_NATJUR`` added ~2017).
+-  **SIA-PA:** 54 → 60 (+6: ``PA_INE``, ``PA_NAT_JUR``, ``PA_SRV_C``, ``PA_VL_CF``, ``PA_VL_CL``, ``PA_VL_INC``).
+-  **SIH:** 86 → 113 (+27), incl. detailed secondary diagnoses ``DIAGSEC1``–``DIAGSEC9`` and their types ``TPDISEC1``–``TPDISEC9``, ICU markers, finance breakdowns.
+-  **SIM:** 54 → 87 (+37, −4). Added: 2010 schooling codes, the death-investigation module, maternal-death detail, codification/version fields. Removed: ``CODBAIOCOR``, ``CODBAIRES``, ``TPASSINA``,
+   ``UFINFORM``.
+
+Implications:
+
+1. Access columns defensively (``df.get("COL", default)``), never assume presence.
+2. The only reliable source of truth for a year is **the DBC header itself**.
+3. Test your pipeline against the **oldest and newest** files before a full run.
+
+.. _8-do-not-trust-third-party-data-dictionaries:
+
+8. Do not trust third-party data dictionaries
+---------------------------------------------
+
+Column names in non-official dictionaries, forums and even some semi-official docs do **not** always match the real DBC field names. With ``df.get("wrong_name")`` the result is a silently empty column
+— no error, no warning.
+
+Real examples (SIA-PA, verified absent in DBC from 2010 and 2024):
+
+================== =============== ============
+Documented (wrong) Actual DBC name Symptom
+================== =============== ============
+``PA_CNPJCC``      ``PA_CNPJ_CC``  always empty
+``PA_CNSPROF``     ``PA_CNSMED``   always empty
+``PA_VPABRE``      ``NU_VPA_TOT``  always 0
+``PA_PAESSION``    ``NU_PA_TOT``   always 0
+``PA_IND_PA``      ``PA_INDICA``   always empty
+================== =============== ============
+
+"Phantom" columns are the dual hazard — names that never existed in any year (e.g. SIM ``STNOVA`` confused with ``STDONOVA``; ``NUMERODO``, which has never existed in SIM disseminação files). Because
+``df.get(...)`` returns NaN, a researcher concludes "DATASUS never fills this field," when in fact the field name is wrong.
+
+**Rule of thumb:** if a field is 100% null, verify the real column name in the DBC header before concluding the field is unused. Dump the header to check:
+
+.. code:: python
+
+   # field descriptors live in the DBF header right after DBC decompression
+   for name, ftype, length, decimals in dbf_fields:
+       print(f"{name:20s} type={ftype} len={length}")
+
+.. _9-same-concept-different-column-names:
+
+9. Same concept, different column names
+---------------------------------------
+
+The same concept is named differently in every system — a Rosetta table:
+
+====================== =========================== ================== ====================== ============== ============== ============ ============
+Concept                SIH-RD                      SIH-SP             SIM                    SIA-AM         SIA-PA         CNES-EQ      CNES-PF
+====================== =========================== ================== ====================== ============== ============== ============ ============
+Primary CID            ``DIAG_PRINC``              ``SP_CIDPRI``      ``CAUSABAS``           ``AP_CIDPRI``  ``PA_CIDPRI``  —            —
+Secondary CID          ``DIAG_SECUN``              —                  ``LINHAA``–``LINHAD``  ``AP_CIDSEC``  ``PA_CIDSEC``  —            —
+Associated CID         ``CID_ASSO``                —                  ``LINHAII``            ``AP_CIDCAS``  ``PA_CIDCAS``  —            —
+Sex                    ``SEXO``                    —                  ``SEXO``               ``AP_SEXO``    ``PA_SEXO``    —            —
+Age                    ``COD_IDADE``\ +\ ``IDADE`` —                  ``IDADE`` (prefixed)   ``AP_NUIDADE`` ``PA_IDADE``   —            —
+Residence municipality ``MUNIC_RES``               —                  ``CODMUNRES``          ``AP_MUNPCN``  ``PA_MUNPCN``  —            —
+Facility (CNES)        ``CNES``                    ``SP_CNES``        —                      ``AP_CODUNI``  ``PA_CODUNI``  ``CNES``     ``CNES``
+Procedure              ``PROC_REA``                ``SP_PROCREA``     —                      ``AP_PRIPAL``  ``PA_PROC_ID`` —            —
+Amount (R$)            ``VAL_TOT``                 ``SP_VALATO``      —                      ``AP_VL_AP``   ``PA_VALAPR``  —            —
+Race/colour            ``RACA_COR``                —                  ``RACACOR``            ``AP_RACACOR`` ``PA_RACACOR`` —            —
+Date/competence        ``DT_INTER`` (YYYYMMDD)     —                  ``DTOBITO`` (DDMMYYYY) competence     competence     ``COMPETEN`` ``COMPETEN``
+Record number          ``N_AIH``                   ``SP_NAIH`` (→ RD) —                      —              —              —            —
+Occupation (CBO)       ``CBOR``                    ``SP_PF_CBO``      —                      —              ``PA_CBOCOD``  —            ``CBO``
+====================== =========================== ================== ====================== ============== ============== ============ ============
+
+.. warning::
+
+   **Date formats are inverted between systems.** SIH uses ``YYYYMMDD``; SIM uses ``DDMMYYYY``. Swapping them produces silently wrong dates (``20230115`` vs ``15012023``).
+
+The SIH carries diagnoses across **up to 15 columns** (``DIAG_PRINC``, ``DIAG_SECUN``, ``DIAGSEC1``–``DIAGSEC9``, ``CID_ASSO``, ``CID_MORTE``, ``CID_NOTIF``) — analyses that read only ``DIAG_PRINC``
+miss most of the clinical picture.
+
+.. _10-sex-encoding-three-different-maps:
+
+10. Sex encoding: three different maps
+--------------------------------------
+
+.. code:: python
+
+   SEXO_MAP_SIH = {"1": "M", "3": "F", "0": "I", "9": "I"}  # SIH: Female = 3 (!)
+   SEXO_MAP_SIM = {"1": "M", "2": "F", "0": "I", "9": "I"}  # SIM: standard numeric
+   SEXO_MAP_SIA = {"M": "M", "F": "F", "I": "I"}            # SIA: already letters
+
+.. warning::
+
+   In SIH, **Female is ``3``, not ``2``** — and ``2`` does not exist. Reusing the SIM map for SIH drops ~half of the female admissions to "unknown".
+
+.. _11-age-encoding-three-different-schemes:
+
+11. Age encoding: three different schemes
+-----------------------------------------
+
+-  **SIH — two fields.** ``COD_IDADE`` (2 = hours, 3 = months, 4 = years, 5 = ≥100, 0 = ignored) + ``IDADE`` (the value). Ignoring ``COD_IDADE`` turns a 6-month-old (``COD_IDADE=3, IDADE=6``) into a
+   6-year-old.
+-  **SIM — single prefixed field.** ``IDADE`` packs the code as the first digit: ``"4065"`` → 65 years, ``"3006"`` → 6 months, ``"501"`` → 101 years.
+-  **SIA — plain numeric.** ``AP_NUIDADE`` / ``PA_IDADE`` are already in years (may be NaN).
+
+Cap implausible values (e.g. 120 years) to catch corruption.
+
+.. _12-cid-10-codes-validity-multicausal-fields-prefix-matching:
+
+12. CID-10 codes: validity, multicausal fields, prefix matching
+---------------------------------------------------------------
+
+-  **~31% garbage.** Across millions of records, a large share of distinct CID values are invalid: ``0000``, empty, binary bytes, no leading letter, trailing letters. Validate before aggregating — a
+   valid CID-10 is one uppercase letter + 2–3 digits: ``^[A-Z]\d{2,3}$``. Counting distinct CIDs without this inflates the count by ~⅓.
+-  **SIM death certificates are multicausal.** Causes span ``LINHAA``–``LINHAD`` (causal chain) plus ``LINHAII`` (contributing conditions); ``CAUSABAS`` is only the single underlying cause picked by
+   the mortality rules. Analyses limited to ``CAUSABAS`` miss conditions recorded as associated/contributing. Moreover, **one field can hold several space-separated CIDs** (``LINHAB = "G200 F032"``),
+   so these fields must be tokenized with ``split()``.
+-  **Prefix vs exact matching.** A 3-char CID (``G40``) should match ``G400/G401/…``, but a 4-char CID must match exactly: matching ``E104`` by the ``E10`` prefix wrongly captures ``E109``. Only
+   3-char codes may be prefix-matched.
+
+.. _13-municipality-codes-6-vs-7-digits:
+
+13. Municipality codes: 6 vs 7 digits
+-------------------------------------
+
+IBGE municipality codes have **7 digits** (last is a check digit); DATASUS stores the **first 6**. A direct join to IBGE tables returns zero rows.
+
+::
+
+   IBGE:    3550308  (São Paulo, 7 digits)
+   DATASUS: 355030   (São Paulo, 6 digits)
+
+Index your municipality lookup by both the full 7-digit code and its 6-digit prefix. The **first 2 digits are the state (UF)** code (``35`` → São Paulo), which is handy for deriving UF without a
+separate field.
+
+.. _14-missing-data-sentinels:
+
+14. Missing-data sentinels
+--------------------------
+
+DATASUS rarely uses NULL; empty values are encoded as sentinels:
+
+============ ==================================
+Field        "Null" sentinel
+============ ==================================
+CEP          ``00000000``
+CNS          ``000000000000000``
+CID          ``0000``
+Generic text ``nan``, ``NAN``, ``None``, ``""``
+Numeric      ``0``, ``9`` (context-dependent)
+============ ==================================
+
+Normalize these to real NULL/NaN on ingest; otherwise counts and joins are skewed.
+
+.. _15-data-types-codes-money-plausibility:
+
+15. Data types: codes, money, plausibility
+------------------------------------------
+
+-  **Codes are text, not integers.** ``codigo_ibge``, ``cnes``, ``cep`` must be ``VARCHAR`` — storing them as INTEGER drops leading zeros (``01001000`` → ``1001000``, São Paulo CEPs become invalid).
+   The same applies to type inference in Excel/R imports.
+-  **Money is decimal, not float.** Store monetary values as ``NUMERIC``/``DECIMAL``; floats introduce ``10.50 → 10.4999…`` drift.
+-  **Cap for plausibility** to flag corruption (e.g. amounts above a sane ceiling, length-of-stay above ~10 years).
+
+.. _16-disappearing-fields-ap_tippre:
+
+16. Disappearing fields: AP_TIPPRE
+----------------------------------
+
+``AP_TIPPRE`` (provider type: public/private/philanthropic) in SIA stops being populated **from 2016 on** — every record becomes ``00``. Byte-level inspection confirms the zeros come from DATASUS, not
+from conversion. From ~2017 the replacement is ``AP_NATJUR`` / ``PA_NAT_JUR`` / ``NAT_JUR`` (legal-nature code, 4 digits, CONCLA/IBGE classification — e.g. ``1023`` State Autarchy, ``3069`` Private
+Foundation).
+
+Consequence: provider-type analysis is only viable for 2008–2015 via ``AP_TIPPRE``; 2017+ uses ``*_NATJUR`` (different granularity); **2016 is a blind year** for both.
+
+.. _17-boolean-encodings-sn-vs-01:
+
+17. Boolean encodings: S/N vs 0/1
+---------------------------------
+
+Boolean fields are not consistently encoded — sometimes not even within one system:
+
+-  SIA-AM ``AM_GESTANT``, ``AM_TRANSPL`` use **``S``/``N``** (text).
+-  SIA-AM ``AP_OBITO``, ``AP_ENCERR``, ``AP_PERMAN``, ``AP_ALTA``, ``AP_TRANSF`` use **``0``/``1``** — in the same files.
+-  SIH boolean fields use ``0``/``1``.
+
+.. warning::
+
+   Testing ``== "1"`` on ``AM_GESTANT`` yields zero positives (its values are ``S``/``N``, never ``1``). Never assume the encoding — check the actual values.
+
+.. _18-auxiliary-reference-tables-cnv-cbo-equipment:
+
+18. Auxiliary reference tables (.CNV, CBO, equipment)
+-----------------------------------------------------
+
+Several coded fields are meaningless without DATASUS auxiliary tables, which ship as fixed-width ``.CNV`` files inside ``*.zip`` bundles on the FTP (e.g. ``SIASUS/200801_/Auxiliar/TAB_SIA.zip``).
+
+-  **Indigenous ethnicity** (``AP_ETNIA`` / ``PA_ETNIA``): SIASI numeric codes for 264 ethnicities, mapped by ``ETNIA.CNV`` (name in cols 0–39, code in cols 40–50; skip the header line). Special codes
+   ``X100``/``X900`` mean "not identified"/"not informed". Only filled for self-declared indigenous patients.
+-  **CBO occupation codes** (e.g. ``225125`` = Neurologist): from the Ministry of Labour's CBO table, not the standard DATASUS dictionary. **Normalize length** — CBO appears with 4, 5 or 6 digits
+   depending on era/state; filtering without normalizing causes false negatives.
+-  **Equipment** (``TIPEQUIP`` + ``CODEQUIP``, CNES-EQ): identify equipment by the **combination** of both fields — ``CODEQUIP`` means different things under different ``TIPEQUIP``.
+
+``.CNV`` files have a proprietary fixed-width format; parse by column offsets, trim, then cast the code.
+
+.. _19-deduplication-and-natural-keys:
+
+19. Deduplication and natural keys
+----------------------------------
+
+-  **SIH:** the same AIH can recur across months (admission crossing month boundary, retroactive correction). Dedup on ``(aih_numero, competence)``.
+-  **SIA-AM / SIA-PA:** there is **no natural per-record key**. Same patient + same drug + same month can be legitimately distinct records. You cannot dedup by content — track processed files instead
+   (mark each DBC as ingested).
+-  **SIH-SP:** no natural key; dedup on ``(file, sequence_number)`` — but ``SEQUENCIA`` can be null in older files, leaving those rows un-deduplicated.
+
+.. _20-system-notes-cnes-and-sih-sp:
+
+20. System notes: CNES and SIH-SP
+---------------------------------
+
+**CNES is a monthly snapshot, not a transactional log.** Each file is one record per equipment (or professional) per facility. Notable traps:
+
+-  **``COMPETEN`` lives in the data, not the filename.** Read ``df["COMPETEN"].iloc[0]`` (e.g. ``"202301"``); you cannot reliably infer competence from the filename. If the column is missing/empty the
+   file should be skipped.
+-  **``IND_SUS`` / ``IND_NSUS`` are text ``"1"``/``"0"``.** Summing them directly concatenates strings (``"111"`` instead of ``3``); cast/compare as strings: ``(s == "1").sum()``.
+-  **Aggregate before loading** if you want municipality-level facts (e.g. group CNES-EQ by ``(CODUFMUN, TIPEQUIP, CODEQUIP)``).
+-  CNES history starts in 2005 (``200508_``), 3 years earlier than SIH/SIA.
+
+**SIH-SP details the SP component of an admission (a parent–child of SIH-RD):**
+
+-  One AIH (RD) maps to **many SP rows** (one per professional act). Never ``GROUP BY`` without accounting for the 1:N relationship.
+-  **Load RD before SP.** SP rows whose AIH is absent from the admissions table are silently dropped — order matters.
+-  ``SP_CIDPRI`` (the act's CID) can **differ** from the admission's ``DIAG_PRINC`` (e.g. a comorbidity treated during the stay) — useful for comorbidity analysis.
+-  The sum of ``SP_VALATO`` for an AIH may not equal the RD's ``VAL_SP``, due to DATASUS rounding/adjustments.
+
+.. _21-population-denominators-ibge:
+
+21. Population denominators (IBGE)
+----------------------------------
+
+Rates "per 100,000" need population by at least UF and year, but IBGE censuses run every 10 years (last: 2022) and annual estimates have gaps. IBGE publishes projections via the SIDRA API (table
+6579), but some years are missing and must be **interpolated**. Using a 2020 population for 2023 rates underestimates rates in high-growth regions — interpolate between the nearest available years.
+
+.. _22-researcher-checklist:
+
+22. Researcher checklist
+------------------------
+
+**Data preparation**
+
+-  ☐ Read DBF/DBC as **latin-1**, never utf-8; strip NUL bytes.
+-  ☐ Validate CIDs with ``^[A-Z]\d{2,3}$`` before aggregating.
+-  ☐ Use the **correct sex map** for the system (SIH: Female = 3).
+-  ☐ Decode age correctly (SIH two-field; SIM prefixed; SIA plain).
+-  ☐ Decide 6- vs 7-digit municipality codes and index both.
+-  ☐ Treat sentinels (``00000000``, ``0000``, …) as NULL.
+-  ☐ Parse the right date format per system (YYYYMMDD vs DDMMYYYY).
+-  ☐ Store codes as text (leading zeros), money as decimal.
+
+**Volume & performance**
+
+-  ☐ Never load a full SIA-PA file into memory — pre-filter at the byte level.
+-  ☐ Isolate DBC decompression in a subprocess (SIGSEGV protection).
+-  ☐ Retry FTP transfers with reconnection; one ``LIST`` per session.
+-  ☐ Re-download a trailing window of recent months (retroactive revisions).
+
+**Evolution & compatibility**
+
+-  ☐ Test against the **oldest and newest** files before a full run.
+-  ☐ Verify column names against the **real DBC header**, not third-party docs.
+-  ☐ If a field is 100% null, suspect a wrong column name first.
+-  ☐ ``AP_TIPPRE`` only to 2015; use ``*_NATJUR`` from 2017 (2016 is blind).
+-  ☐ Check boolean encoding per field (S/N vs 0/1).
+-  ☐ Decode coded fields with the right ``.CNV`` auxiliary table; normalize CBO length.
+
+**Analysis**
+
+-  ☐ Dedup SIH on ``(aih_numero, competence)``; SIA has no natural key.
+-  ☐ For SIM, read ``LINHAA``–``LINHAD`` and ``LINHAII``, not only ``CAUSABAS``; tokenize multi-CID fields.
+-  ☐ Check IBGE population availability for the year; interpolate gaps.
+
+--------------
+
+*Derived from real ETL experience over SIH, SIH-SP, SIM, SIA-AM, SIA-PA, CNES-EQ and CNES-PF for all 27 Brazilian states (2008–2025; CNES from 2005). Contributed to PySUS as practical guidance for the
+community.*