Skip to content

fix(wikipedia): accept mobile (m.wikipedia.org) URLs#1850

Open
GopalGB wants to merge 1 commit intomicrosoft:mainfrom
GopalGB:fix/wikipedia-mobile-url-support
Open

fix(wikipedia): accept mobile (m.wikipedia.org) URLs#1850
GopalGB wants to merge 1 commit intomicrosoft:mainfrom
GopalGB:fix/wikipedia-mobile-url-support

Conversation

@GopalGB
Copy link
Copy Markdown

@GopalGB GopalGB commented Apr 30, 2026

Summary

WikipediaConverter.accepts() rejected mobile Wikipedia URLs
(e.g. https://en.m.wikipedia.org/wiki/Foo) because its regex required
the language subdomain (2-3 letters) to be immediately followed by
.wikipedia.org. Mobile URLs have an extra .m segment between the
language code and .wikipedia.org, so they fell through to the
generic HtmlConverter, producing noisy navigation/footer markdown
instead of the clean main-content extraction WikipediaConverter
provides.

What changed

One-line regex change in _wikipedia_converter.py:

- if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
+ if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/", url):

The optional (\.m)? allows mobile subdomains while preserving:

  • The ^https?: anchor
  • Rejection of non-Wikipedia hosts like wikipedia.com
  • Rejection of arbitrary subdomains like xyz.example.com

How to test

import re
pat = r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/"

# Should match
assert re.search(pat, "https://en.wikipedia.org/wiki/X")
assert re.search(pat, "https://en.m.wikipedia.org/wiki/X")
assert re.search(pat, "https://de.m.wikipedia.org/wiki/X")
assert re.search(pat, "https://fr.m.wikipedia.org/wiki/X")

# Should not match
assert not re.search(pat, "https://wikipedia.com/x")
assert not re.search(pat, "https://en.example.org/x")

Why this is small + safe

  • Single line of production code changed
  • The change is purely additive: every previously-accepted URL is still
    accepted (the new group (\.m)? is optional)
  • No behavior change for the convert() path
  • No new dependencies

Notes

This is orthogonal to PR #1723's hyphenated-language-code fix
(be-tarask, zh-classical, etc.) - mobile URLs are a separate gap
not covered by that PR's regex [a-zA-Z0-9-]+\.wikipedia.org because
the .m. segment still has to fit between the language code and
.wikipedia.org. The two fixes can compose if both are merged.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…erter

The `accepts()` regex required the host to match
`[a-zA-Z]{2,3}\.wikipedia\.org`, which rejected mobile Wikipedia URLs
of the form `https://en.m.wikipedia.org/...`. As a result, mobile
Wikipedia pages fell through to the generic `HtmlConverter`, producing
noisy navigation/footer markdown instead of the clean main-content
extraction that `WikipediaConverter` performs.

Updated the regex to allow an optional `.m` between the language code
and `.wikipedia.org`:

```
before: ^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/
after:  ^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/
```

This is the minimal, surgical change. It preserves rejection of
non-Wikipedia hosts (e.g. `wikipedia.com`) and the `^https?` anchor.

How to test:

```python
import re
pat = r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/"
assert re.search(pat, "https://en.wikipedia.org/wiki/X")
assert re.search(pat, "https://en.m.wikipedia.org/wiki/X")
assert re.search(pat, "https://de.m.wikipedia.org/wiki/X")
assert not re.search(pat, "https://wikipedia.com/x")
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GopalGB
Copy link
Copy Markdown
Author

GopalGB commented May 7, 2026

Friendly ping — license/cla is signed and the change is small (3-line URL parser tweak to accept m.wikipedia.org mobile URLs). Tests included. cc maintainers when cycles allow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant