Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

mapper-csv is a set of Python tools for mapping CSV files into JSON format for loading into Senzing entity resolution software. The tools analyze CSV files, generate column statistics, and produce either JSON mapping files or standalone Python mapper modules.

## Commands

### Install Dependencies

```bash
python -m pip install --group all .
```

### Linting

```bash
pylint $(git ls-files '*.py' ':!:docs/source/*')
```

### Run CSV Analyzer (generate Python module - recommended)

```bash
python src/csv_analyzer.py -i input/file.csv -o output/analysis.csv -p mappings/file.py
```

### Run CSV Analyzer (generate mapping file)

```bash
python src/csv_analyzer.py -i input/file.csv -o output/analysis.csv -m mappings/file.map
```

### Run CSV Mapper with mapping file

```bash
python src/csv_mapper.py -i input/file.csv -m mappings/file.map -o output/file.json -l output/stats.json
```

### Run standalone Python mapper module

```bash
python mappings/file.py -i input/file.csv -o output/file.json -l output/stats.json
```

## Architecture

### Core Scripts (src/)

- **csv_analyzer.py**: Analyzes CSV files to generate column statistics (percent populated, percent unique, top 5 values). Outputs either a JSON mapping file (`-m`) or a standalone Python mapper module (`-p`) based on `python_template.py`.

- **csv_mapper.py**: Processes CSV files using either a JSON mapping file (`-m`) or a Python mapper module (`-p`) to produce Senzing-compatible JSON output. Supports calculations, filters, and attribute aggregation.

- **csv_functions.py**: Utility class providing date formatting, value cleaning, and Senzing attribute detection. Loads configuration from `csv_functions.json` which defines garbage values and valid Senzing attributes.

- **python_template.py**: Template used by csv_analyzer to generate standalone mapper modules. Contains the `mapper` class with methods for cleaning values, computing record hashes, formatting dates, and capturing statistics.

### Mapping File Structure

JSON mapping files have three sections:

- **input**: File settings (inputFileName, fieldDelimiter, columnHeaders)
- **calculations**: Python expressions to create derived columns (e.g., `{"name_org": "rowData['name'] if rowData['type'] == 'company' else ''}`)
- **outputs**: Data source, record type, record ID, filters, and attribute mappings to Senzing attributes

### Key Concepts

- Senzing attributes (NAME_FULL, NAME_ORG, SSN_NUMBER, DATE_OF_BIRTH, ADDR_LINE1, etc.) are used for entity resolution
- Non-Senzing attributes are preserved but not used for matching
- The `<ignore>` attribute tag excludes columns from output
- Calculations use `rowData['columnName']` syntax to reference column values
3 changes: 0 additions & 3 deletions .claude/commands/senzing-code-review.md

This file was deleted.

3 changes: 3 additions & 0 deletions .claude/commands/senzing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Senzing

- Perform the steps specified by <https://raw.githubusercontent.com/senzing-factory/claude/refs/tags/v1/commands/senzing.md>
File renamed without changes.
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Default code owner
# Default code owner

* @Senzing/senzing-mappers

Expand Down
16 changes: 10 additions & 6 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,15 @@

version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
- package-ecosystem: github-actions
cooldown:
default-days: 21
directory: /
schedule:
interval: "daily"
- package-ecosystem: "pip"
directory: "/"
interval: daily
- package-ecosystem: pip
cooldown:
default-days: 21
directory: /
schedule:
interval: "daily"
interval: daily
2 changes: 1 addition & 1 deletion .github/workflows/add-labels-standardized.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: add labels standardized
name: Add labels standardized

on:
issues:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/add-to-project-senzing-dependabot.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: add to project senzing github organization dependabot
name: Add to project senzing github organization dependabot

on:
pull_request:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/add-to-project-senzing.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: add to project senzing github organization
name: Add to project senzing github organization

on:
issues:
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/claude-pr-review.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
name: Claude PR Review

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref_name }}
cancel-in-progress: true

on:
pull_request:
types: [opened, synchronize]

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref_name }}
cancel-in-progress: true

permissions: {}

jobs:
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/dependabot-approve-and-merge.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ on:
pull_request:
branches: [main]

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref_name }}
cancel-in-progress: true

permissions: {}

jobs:
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/lint-workflows.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
name: lint workflows
name: Lint workflows

on:
push:
branches-ignore: [main]
pull_request:
branches: [main]

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref_name }}
cancel-in-progress: true

permissions: {}

jobs:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/move-pr-to-done-dependabot.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: move pr to done dependabot
name: Move pr to done dependabot

on:
pull_request:
Expand Down
17 changes: 12 additions & 5 deletions .github/workflows/pylint.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
name: pylint
name: Pylint

on: [push]
on:
pull_request:
branches: [main]

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref_name }}
cancel-in-progress: true

permissions: {}

Expand All @@ -12,8 +18,10 @@ jobs:
contents: read
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
python-version: ["3.10", "3.11", "3.12", "3.13"]
timeout-minutes: 10

steps:
- name: Checkout repository
Expand All @@ -32,8 +40,7 @@ jobs:
source ./venv/bin/activate
echo "PATH=${PATH}" >> "${GITHUB_ENV}"
python -m pip install --upgrade pip
python -m pip install --requirement development-requirements.txt
python -m pip install --requirement requirements.txt
python -m pip install --group all .

- name: Analysing the code with pylint
run: |
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/spellcheck.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
name: spellcheck
name: Spellcheck

on:
pull_request:
branches: [main]

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref_name }}
cancel-in-progress: true

permissions: {}

jobs:
Expand Down
27 changes: 22 additions & 5 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,48 @@
"version": "0.2",
"language": "en",
"words": [
"analysing",
"applehelp",
"autodoc",
"autodocsumm",
"bugtracker",
"CCLA",
"CODEOWNER",
"DSRC",
"ICLA",
"Senzing",
"analysing",
"cooldown",
"dateutil",
"devhelp",
"DSRC",
"elif",
"esbenp",
"htmlhelp",
"ICLA",
"isort",
"jquery",
"jsmath",
"kwargs",
"mult",
"mypy",
"newcolumn",
"probablepeople",
"psutil",
"pydev",
"pylint",
"pytest",
"pythonic",
"qthelp",
"remoteliteralinclude",
"Senzing",
"serializinghtml",
"setuptools",
"shellcheck",
"sphinxcontrib",
"sphinxext",
"stackoverflow",
"statpack",
"subrecord",
"venv"
"typehints",
"venv",
"virtualenv"
],
"ignorePaths": [
".git/**",
Expand Down
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
[markdownlint](https://dlaa.me/markdownlint/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
The changelog format is based on [Keep a Changelog] and [CommonMark].
This project adheres to [Semantic Versioning].

## [Unreleased]

Expand All @@ -27,3 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Thing 2
- Thing 1

[CommonMark]: https://commonmark.org/
[Keep a Changelog]: https://keepachangelog.com/
[Semantic Versioning]: https://semver.org/
Loading