Update dependency chardet to v6 #68
Reference in New Issue
Block a user
Delete Branch "renovate/chardet-6.x"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR contains the following updates:
==5.2.0→==6.0.0.post1Release Notes
chardet/chardet (chardet)
v6.0.0Compare Source
Features
Latin1ProberandMacRomanProberheuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.EncodingErafiltering: Newencoding_eraparameter todetectallows filtering by anEncodingEraflag enum (MODERN_WEB,LEGACY_ISO,LEGACY_MAC,LEGACY_REGIONAL,DOS,MAINFRAME,ALL) allows callers to restrict detection to encodings from a specific era.detect()anddetect_all()default toMODERN_WEB. The newMODERN_WEBdefault should drastically improve accuracy for users who are not working with legacy data. The tiers are:MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)MAINFRAME: EBCDIC variants (CP037, CP500, etc.)--encoding-eraCLI flag: ThechardetectCLI now accepts-e/--encoding-erato control which encoding eras are considered during detection.max_bytesandchunk_sizeparameters:detect(),detect_all(), andUniversalDetectornow acceptmax_bytes(default 200KB) andchunk_size(default 64KB) parameters for controlling how much data is examined. (#314, @bysiber)chardet.metadata.charsetsmodule provides structured metadata about all supported encodings, including their era classification and language filter.should_rename_legacynow defaults intelligently: When set toNone(the new default), legacy renaming is automatically enabled whenencoding_eraisMODERN_WEB.Fixes
SJISDistributionAnalysisdiscarding valid second-byte range >= 0x80. (#315, @bysiber)MIN_RATIOthreshold alongside the existingEXPECTED_RATIO.get_charsetcrash: Resolved a crash when looking up unknown charset names.char_len_table: Corrected the character length table for GB18030 multi-byte sequences.detect_all()returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.Breaking changes
Latin1ProberandMacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected bySingleByteCharSetProberwith trained language models, giving better accuracy and language identification.LanguageFilter.NONEremoved: Use specific language filters orLanguageFilter.ALLinstead.InputState,ProbingState,MachineState,SequenceLikelihood, andCharacterCategoryare nowIntEnum(previously plain classes orEnum).LanguageFiltervalues changed from hardcoded hex toauto().detect()default behavior change:detect()now defaults toencoding_era=EncodingEra.MODERN_WEBandshould_rename_legacy=None(auto-enabled forMODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.Misc changes
hatch-vcsfor version management.create_language_model.pytraining script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models.Languageclass converted to frozen dataclass: The language metadata class now uses@dataclass(frozen=True)withnum_training_docsandnum_training_charsfields replacingwiki_start_pages.pytest-timeoutandpytest-xdistfor faster parallel test execution. Reorganized test data directories.Contributors
Thank you to everyone who contributed to this release!
And a special thanks to @helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Renovate Bot.
ac68a94d10to1d4ea2436d