fix(wikipedia): accept mobile (m.wikipedia.org) URLs#1850
Open
GopalGB wants to merge 1 commit intomicrosoft:mainfrom
Open
fix(wikipedia): accept mobile (m.wikipedia.org) URLs#1850GopalGB wants to merge 1 commit intomicrosoft:mainfrom
GopalGB wants to merge 1 commit intomicrosoft:mainfrom
Conversation
…erter
The `accepts()` regex required the host to match
`[a-zA-Z]{2,3}\.wikipedia\.org`, which rejected mobile Wikipedia URLs
of the form `https://en.m.wikipedia.org/...`. As a result, mobile
Wikipedia pages fell through to the generic `HtmlConverter`, producing
noisy navigation/footer markdown instead of the clean main-content
extraction that `WikipediaConverter` performs.
Updated the regex to allow an optional `.m` between the language code
and `.wikipedia.org`:
```
before: ^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/
after: ^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/
```
This is the minimal, surgical change. It preserves rejection of
non-Wikipedia hosts (e.g. `wikipedia.com`) and the `^https?` anchor.
How to test:
```python
import re
pat = r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/"
assert re.search(pat, "https://en.wikipedia.org/wiki/X")
assert re.search(pat, "https://en.m.wikipedia.org/wiki/X")
assert re.search(pat, "https://de.m.wikipedia.org/wiki/X")
assert not re.search(pat, "https://wikipedia.com/x")
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
Friendly ping — license/cla is signed and the change is small (3-line URL parser tweak to accept |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WikipediaConverter.accepts()rejected mobile Wikipedia URLs(e.g.
https://en.m.wikipedia.org/wiki/Foo) because its regex requiredthe language subdomain (2-3 letters) to be immediately followed by
.wikipedia.org. Mobile URLs have an extra.msegment between thelanguage code and
.wikipedia.org, so they fell through to thegeneric
HtmlConverter, producing noisy navigation/footer markdowninstead of the clean main-content extraction
WikipediaConverterprovides.
What changed
One-line regex change in
_wikipedia_converter.py:The optional
(\.m)?allows mobile subdomains while preserving:^https?:anchorwikipedia.comxyz.example.comHow to test
Why this is small + safe
accepted (the new group
(\.m)?is optional)convert()pathNotes
This is orthogonal to PR #1723's hyphenated-language-code fix
(
be-tarask,zh-classical, etc.) - mobile URLs are a separate gapnot covered by that PR's regex
[a-zA-Z0-9-]+\.wikipedia.orgbecausethe
.m.segment still has to fit between the language code and.wikipedia.org. The two fixes can compose if both are merged.Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com