Update dependency chardet to v6 #68

Merged
timatlee merged 1 commits from renovate/chardet-6.x into main 2026-02-22 11:05:35 -07:00
Collaborator

This PR contains the following updates:

Package Update Change
chardet major ==5.2.0==6.0.0.post1

Release Notes

chardet/chardet (chardet)

v6.0.0

Compare Source

Features
  • Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case Latin1Prober and MacRomanProber heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.
  • 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
  • EncodingEra filtering: New encoding_era parameter to detect allows filtering by an EncodingEra flag enum (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL) allows callers to restrict detection to encodings from a specific era. detect() and detect_all() default to MODERN_WEB. The new MODERN_WEB default should drastically improve accuracy for users who are not working with legacy data. The tiers are:
    • MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)
    • LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)
    • LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)
    • LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)
    • DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)
    • MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
  • --encoding-era CLI flag: The chardetect CLI now accepts -e/--encoding-era to control which encoding eras are considered during detection.
  • max_bytes and chunk_size parameters: detect(), detect_all(), and UniversalDetector now accept max_bytes (default 200KB) and chunk_size (default 64KB) parameters for controlling how much data is examined. (#​314, @​bysiber)
  • Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
  • Charset metadata registry: New chardet.metadata.charsets module provides structured metadata about all supported encodings, including their era classification and language filter.
  • should_rename_legacy now defaults intelligently: When set to None (the new default), legacy renaming is automatically enabled when encoding_era is MODERN_WEB.
  • Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.
  • EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
  • Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.
  • Python 3.12, 3.13, and 3.14 support (#​283, @​hugovk; #​311)
  • GitHub Codespace support (#​312, @​oxygen-dioxide)
Fixes
  • Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#​268, @​nenw)
  • Fix SJIS distribution analysis: Fixed SJISDistributionAnalysis discarding valid second-byte range >= 0x80. (#​315, @​bysiber)
  • Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a MIN_RATIO threshold alongside the existing EXPECTED_RATIO.
  • Fix get_charset crash: Resolved a crash when looking up unknown charset names.
  • Fix GB18030 char_len_table: Corrected the character length table for GB18030 multi-byte sequences.
  • Fix UTF-8 state machine: Updated to be more spec-compliant.
  • Fix detect_all() returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.
  • Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.
  • Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.
Breaking changes
  • Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#​283, @​hugovk)
  • Removed Latin1Prober and MacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by SingleByteCharSetProber with trained language models, giving better accuracy and language identification.
  • Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.
  • LanguageFilter.NONE removed: Use specific language filters or LanguageFilter.ALL instead.
  • Enum types changed: InputState, ProbingState, MachineState, SequenceLikelihood, and CharacterCategory are now IntEnum (previously plain classes or Enum). LanguageFilter values changed from hardcoded hex to auto().
  • detect() default behavior change: detect() now defaults to encoding_era=EncodingEra.MODERN_WEB and should_rename_legacy=None (auto-enabled for MODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.
Misc changes
  • Switched from Poetry/setuptools to uv + hatchling: Build system modernized with hatch-vcs for version management.
  • License text updated: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. (#​304, #​307, @​musicinmybrain)
  • CulturaX-based model training: The create_language_model.py training script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models.
  • Language class converted to frozen dataclass: The language metadata class now uses @dataclass(frozen=True) with num_training_docs and num_training_chars fields replacing wiki_start_pages.
  • Test infrastructure: Added pytest-timeout and pytest-xdist for faster parallel test execution. Reorganized test data directories.
Contributors

Thank you to everyone who contributed to this release!

And a special thanks to @​helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

This PR contains the following updates: | Package | Update | Change | |---|---|---| | [chardet](https://github.com/chardet/chardet) | major | `==5.2.0` → `==6.0.0.post1` | --- ### Release Notes <details> <summary>chardet/chardet (chardet)</summary> ### [`v6.0.0`](https://github.com/chardet/chardet/releases/tag/6.0.0) [Compare Source](https://github.com/chardet/chardet/compare/5.2.0...6.0.0) ##### Features - **Unified single-byte charset detection**: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case `Latin1Prober` and `MacRomanProber` heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding *and* the language for all supported single-byte encodings. - **38 new languages**: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline. - **`EncodingEra` filtering**: New `encoding_era` parameter to `detect` allows filtering by an `EncodingEra` flag enum (`MODERN_WEB`, `LEGACY_ISO`, `LEGACY_MAC`, `LEGACY_REGIONAL`, `DOS`, `MAINFRAME`, `ALL`) allows callers to restrict detection to encodings from a specific era. `detect()` and `detect_all()` default to `MODERN_WEB`. The new `MODERN_WEB` default should drastically improve accuracy for users who are not working with legacy data. The tiers are: - `MODERN_WEB`: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web) - `LEGACY_ISO`: ISO-8859-x, KOI8-R/U (legacy but well-known standards) - `LEGACY_MAC`: Mac-specific encodings (MacRoman, MacCyrillic, etc.) - `LEGACY_REGIONAL`: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) - `DOS`: DOS/OEM code pages (CP437, CP850, CP866, etc.) - `MAINFRAME`: EBCDIC variants (CP037, CP500, etc.) - **`--encoding-era` CLI flag**: The `chardetect` CLI now accepts `-e`/`--encoding-era` to control which encoding eras are considered during detection. - **`max_bytes` and `chunk_size` parameters**: `detect()`, `detect_all()`, and `UniversalDetector` now accept `max_bytes` (default 200KB) and `chunk_size` (default 64KB) parameters for controlling how much data is examined. ([#&#8203;314](https://github.com/chardet/chardet/issues/314), [@&#8203;bysiber](https://github.com/bysiber)) - **Encoding era preference tie-breaking**: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones. - **Charset metadata registry**: New `chardet.metadata.charsets` module provides structured metadata about all supported encodings, including their era classification and language filter. - **`should_rename_legacy` now defaults intelligently**: When set to `None` (the new default), legacy renaming is automatically enabled when `encoding_era` is `MODERN_WEB`. - **Direct GB18030 support**: Replaced the redundant GB2312 prober with a proper GB18030 prober. - **EBCDIC detection**: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection. - **Binary file detection**: Added basic binary file detection to abort analysis earlier on non-text files. - **Python 3.12, 3.13, and 3.14 support** ([#&#8203;283](https://github.com/chardet/chardet/issues/283), [@&#8203;hugovk](https://github.com/hugovk); [#&#8203;311](https://github.com/chardet/chardet/issues/311)) - **GitHub Codespace support** ([#&#8203;312](https://github.com/chardet/chardet/issues/312), [@&#8203;oxygen-dioxide](https://github.com/oxygen-dioxide)) ##### Fixes - **Fix CP949 state machine**: Corrected the state machine for Korean CP949 encoding detection. ([#&#8203;268](https://github.com/chardet/chardet/issues/268), [@&#8203;nenw](https://github.com/nenw)) - **Fix SJIS distribution analysis**: Fixed `SJISDistributionAnalysis` discarding valid second-byte range >= 0x80. ([#&#8203;315](https://github.com/chardet/chardet/issues/315), [@&#8203;bysiber](https://github.com/bysiber)) - **Fix UTF-16/32 detection for non-ASCII-heavy text**: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a `MIN_RATIO` threshold alongside the existing `EXPECTED_RATIO`. - **Fix `get_charset` crash**: Resolved a crash when looking up unknown charset names. - **Fix GB18030 `char_len_table`**: Corrected the character length table for GB18030 multi-byte sequences. - **Fix UTF-8 state machine**: Updated to be more spec-compliant. - **Fix `detect_all()` returning inactive probers**: Results from probers that determined "definitely not this encoding" are now excluded. - **Fix early cutoff bug**: Resolved an issue where detection could terminate prematurely. - **Default UTF-8 fallback**: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default. ##### Breaking changes - **Dropped Python 3.7, 3.8, and 3.9 support**: Now requires Python 3.10+. ([#&#8203;283](https://github.com/chardet/chardet/issues/283), [@&#8203;hugovk](https://github.com/hugovk)) - **Removed `Latin1Prober` and `MacRomanProber`**: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by `SingleByteCharSetProber` with trained language models, giving better accuracy and language identification. - **Removed EUC-TW support**: EUC-TW encoding detection has been removed as it is extremely rare in practice. - **`LanguageFilter.NONE` removed**: Use specific language filters or `LanguageFilter.ALL` instead. - **Enum types changed**: `InputState`, `ProbingState`, `MachineState`, `SequenceLikelihood`, and `CharacterCategory` are now `IntEnum` (previously plain classes or `Enum`). `LanguageFilter` values changed from hardcoded hex to `auto()`. - **`detect()` default behavior change**: `detect()` now defaults to `encoding_era=EncodingEra.MODERN_WEB` and `should_rename_legacy=None` (auto-enabled for `MODERN_WEB`), whereas previously it defaulted to considering all encodings with no legacy renaming. ##### Misc changes - **Switched from Poetry/setuptools to uv + hatchling**: Build system modernized with `hatch-vcs` for version management. - **License text updated**: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. ([#&#8203;304](https://github.com/chardet/chardet/issues/304), [#&#8203;307](https://github.com/chardet/chardet/issues/307), [@&#8203;musicinmybrain](https://github.com/musicinmybrain)) - **CulturaX-based model training**: The `create_language_model.py` training script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models. - **`Language` class converted to frozen dataclass**: The language metadata class now uses `@dataclass(frozen=True)` with `num_training_docs` and `num_training_chars` fields replacing `wiki_start_pages`. - **Test infrastructure**: Added `pytest-timeout` and `pytest-xdist` for faster parallel test execution. Reorganized test data directories. ##### Contributors Thank you to everyone who contributed to this release! - [@&#8203;dan-blanchard](https://github.com/dan-blanchard) (Dan Blanchard) - [@&#8203;bysiber](https://github.com/bysiber) (Kadir Can Ozden) - [@&#8203;musicinmybrain](https://github.com/musicinmybrain) (Ben Beasley) - [@&#8203;hugovk](https://github.com/hugovk) (Hugo van Kemenade) - [@&#8203;oxygen-dioxide](https://github.com/oxygen-dioxide) - [@&#8203;nenw](https://github.com/nenw) And a special thanks to [@&#8203;helour](https://github.com/helour), whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4zMS4xIiwidXBkYXRlZEluVmVyIjoiNDMuMzEuMSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOltdfQ==-->
renovate-bot added 1 commit 2026-02-21 21:00:18 -07:00
Update dependency chardet to v6
All checks were successful
Build Docker Image / build (pull_request) Successful in 1m57s
ac68a94d10
renovate-bot force-pushed renovate/chardet-6.x from ac68a94d10 to 1d4ea2436d 2026-02-22 09:00:24 -07:00 Compare
timatlee merged commit 5cf1d2e933 into main 2026-02-22 11:05:35 -07:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: timatlee/cloudflare-ddns-docker-updated#68