Skip to content

ONSdigital/soc-classification-library

Repository files navigation

SOC Classification Library

Standard Occupational Classification (SOC) Library, initially developed for Survey Assist API but can be used elsewhere.

Overview

SOC classification library, utilities used to classify occupation code based off the official ONS SOC 2020 structure and coding index.

Features

  • SOC Lookup. A utility that uses a well-known set of SOC mappings of job titles to SOC classification codes.
  • SOC Classification. A RAG approach to classification of SOC using input data, semantic search and LLM.
  • SOC Rephrase. Packaged example data and SOCRephraseLookup for mapping soc_code values to respondent-friendly rephrased descriptions.

Prerequisites

Ensure you have the following installed on your local machine:

  • Python 3.12 (Recommended: use pyenv to manage versions)
  • poetry (for dependency management)
  • Colima (if running locally with containers)
  • Terraform (for infrastructure management)
  • Google Cloud SDK (gcloud) with appropriate permissions

Local Development Setup

The Makefile defines a set of commonly used commands and workflows. Where possible use the files defined in the Makefile.

Clone the repository

git clone https://github.com/ONSdigital/soc-classification-library.git
cd soc-classification-library

Install Dependencies

poetry install

Add Git Hooks

Git hooks can be used to check code before commit. To install run:

pre-commit install

Run Locally

There is example source for using the SOC Lookup functionality in soc_lookup_example.py to run:

poetry run python src/occupational_classification/lookup/soc_lookup_example.py

The library also ships with small packaged example datasets used by downstream services (e.g. survey-assist-api) for end-to-end testing:

  • SOC lookup example CSV: src/occupational_classification/data/example_soc_lookup_data.csv
  • SOC rephrase example CSV: src/occupational_classification/data/example_rephrased_soc_data.csv

GCP Setup

${\small\color{red}\text{TODO}}$

Code Quality

Code quality and static analysis will be enforced using isort, black, ruff, mypy and pylint. Security checking will be enhanced by running bandit.

To check the code quality, but only report any errors without auto-fix run:

make check-python-nofix

To check the code quality and automatically fix errors where possible run:

make check-python

Documentation

Documentation is available in the docs/ folder and can be viewed using mkdocs:

make run-docs

Testing

Pytest is used for testing alongside pytest-cov for coverage testing. /tests/conftest.py defines config used by the tests.

Unit testing for utility functions is added to the /tests/tests_utils.py

make unit-tests

All tests can be run using

make all-tests

Environment Variables

This library is designed to be consumed as a Python package and does not require any environment variables on its own. Downstream services (such as survey-assist-api) may define their own configuration around it.

About

Library of classification functionality associated with UK SOC (Standard Occupational Classification)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors