fix(wikipedia): accept mobile (m.wikipedia.org) URLs by GopalGB · Pull Request #1850 · microsoft/markitdown

GopalGB · 2026-04-30T02:23:02Z

Summary

WikipediaConverter.accepts() rejected mobile Wikipedia URLs
(e.g. https://en.m.wikipedia.org/wiki/Foo) because its regex required
the language subdomain (2-3 letters) to be immediately followed by
.wikipedia.org. Mobile URLs have an extra .m segment between the
language code and .wikipedia.org, so they fell through to the
generic HtmlConverter, producing noisy navigation/footer markdown
instead of the clean main-content extraction WikipediaConverter
provides.

What changed

One-line regex change in _wikipedia_converter.py:

- if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
+ if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/", url):

The optional (\.m)? allows mobile subdomains while preserving:

The ^https?: anchor
Rejection of non-Wikipedia hosts like wikipedia.com
Rejection of arbitrary subdomains like xyz.example.com

How to test

import re
pat = r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/"

# Should match
assert re.search(pat, "https://en.wikipedia.org/wiki/X")
assert re.search(pat, "https://en.m.wikipedia.org/wiki/X")
assert re.search(pat, "https://de.m.wikipedia.org/wiki/X")
assert re.search(pat, "https://fr.m.wikipedia.org/wiki/X")

# Should not match
assert not re.search(pat, "https://wikipedia.com/x")
assert not re.search(pat, "https://en.example.org/x")

Why this is small + safe

Single line of production code changed
The change is purely additive: every previously-accepted URL is still
accepted (the new group (\.m)? is optional)
No behavior change for the convert() path
No new dependencies

Notes

This is orthogonal to PR #1723's hyphenated-language-code fix
(be-tarask, zh-classical, etc.) - mobile URLs are a separate gap
not covered by that PR's regex [a-zA-Z0-9-]+\.wikipedia.org because
the .m. segment still has to fit between the language code and
.wikipedia.org. The two fixes can compose if both are merged.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…erter The `accepts()` regex required the host to match `[a-zA-Z]{2,3}\.wikipedia\.org`, which rejected mobile Wikipedia URLs of the form `https://en.m.wikipedia.org/...`. As a result, mobile Wikipedia pages fell through to the generic `HtmlConverter`, producing noisy navigation/footer markdown instead of the clean main-content extraction that `WikipediaConverter` performs. Updated the regex to allow an optional `.m` between the language code and `.wikipedia.org`: ``` before: ^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/ after: ^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/ ``` This is the minimal, surgical change. It preserves rejection of non-Wikipedia hosts (e.g. `wikipedia.com`) and the `^https?` anchor. How to test: ```python import re pat = r"^https?:\/\/[a-zA-Z]{2,3}(\.m)?\.wikipedia.org\/" assert re.search(pat, "https://en.wikipedia.org/wiki/X") assert re.search(pat, "https://en.m.wikipedia.org/wiki/X") assert re.search(pat, "https://de.m.wikipedia.org/wiki/X") assert not re.search(pat, "https://wikipedia.com/x") ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GopalGB · 2026-05-07T14:39:51Z

Friendly ping — license/cla is signed and the change is small (3-line URL parser tweak to accept m.wikipedia.org mobile URLs). Tests included. cc maintainers when cycles allow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wikipedia): accept mobile (m.wikipedia.org) URLs#1850

fix(wikipedia): accept mobile (m.wikipedia.org) URLs#1850
GopalGB wants to merge 1 commit intomicrosoft:mainfrom
GopalGB:fix/wikipedia-mobile-url-support

GopalGB commented Apr 30, 2026

Uh oh!

GopalGB commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GopalGB commented Apr 30, 2026

Summary

What changed

How to test

Why this is small + safe

Notes

Uh oh!

GopalGB commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant