Skip to content

Commit 41823e6

Browse files
miss-islingtongpsheadserhiy-storchakaorsenthil
authored andcommitted
[3.9] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (pythonGH-25595) (pythonGH-25725)
* bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (pythonGH-25595) Co-authored-by: Gregory P. Smith <greg@krypto.org> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> (cherry picked from commit 76cd81d) Co-authored-by: Senthil Kumaran <skumaran@gatech.edu> (backported to Python 2.7 by Michał Górny)
1 parent f9e5d7a commit 41823e6

4 files changed

Lines changed: 55 additions & 0 deletions

File tree

Doc/library/urlparse.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,9 @@ The :mod:`urlparse` module defines the following functions:
267267
decomposed before parsing, or is not a Unicode string, no error will be
268268
raised.
269269

270+
Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
271+
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.
272+
270273
.. versionadded:: 2.2
271274

272275
.. versionchanged:: 2.5
@@ -276,6 +279,10 @@ The :mod:`urlparse` module defines the following functions:
276279
Characters that affect netloc parsing under NFKC normalization will
277280
now raise :exc:`ValueError`.
278281

282+
.. versionchanged:: 2.7.18_p9 (Gentoo)
283+
ASCII newline and tab characters are stripped from the URL.
284+
285+
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
279286

280287
.. function:: urlunsplit(parts)
281288

@@ -327,6 +334,10 @@ The :mod:`urlparse` module defines the following functions:
327334

328335
.. seealso::
329336

337+
`WHATWG`_ - URL Living standard
338+
Working Group for the URL Standard that defines URLs, domains, IP addresses, the
339+
application/x-www-form-urlencoded format, and their API.
340+
330341
:rfc:`3986` - Uniform Resource Identifiers
331342
This is the current standard (STD66). Any changes to urlparse module
332343
should conform to this. Certain deviations could be observed, which are
@@ -351,6 +362,8 @@ The :mod:`urlparse` module defines the following functions:
351362
:rfc:`1738` - Uniform Resource Locators (URL)
352363
This specifies the formal syntax and semantics of absolute URLs.
353364

365+
.. _WHATWG: https://url.spec.whatwg.org/
366+
354367

355368
.. _urlparse-result-object:
356369

Lib/test/test_urlparse.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -543,6 +543,35 @@ def test_telurl_params(self):
543543
self.assertEqual(p1.params, 'phone-context=+1-914-555')
544544

545545

546+
def test_urlsplit_remove_unsafe_bytes(self):
547+
# Remove ASCII tabs and newlines from input
548+
url = "http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
549+
p = urlparse.urlsplit(url)
550+
self.assertEqual(p.scheme, "http")
551+
self.assertEqual(p.netloc, "www.python.org")
552+
self.assertEqual(p.path, "/javascript:alert('msg')/")
553+
self.assertEqual(p.query, "")
554+
self.assertEqual(p.fragment, "frag")
555+
self.assertEqual(p.username, None)
556+
self.assertEqual(p.password, None)
557+
self.assertEqual(p.hostname, "www.python.org")
558+
self.assertEqual(p.port, None)
559+
self.assertEqual(p.geturl(), "http://www.python.org/javascript:alert('msg')/#frag")
560+
561+
# Remove ASCII tabs and newlines from input as bytes.
562+
url = b"http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
563+
p = urlparse.urlsplit(url)
564+
self.assertEqual(p.scheme, b"http")
565+
self.assertEqual(p.netloc, b"www.python.org")
566+
self.assertEqual(p.path, b"/javascript:alert('msg')/")
567+
self.assertEqual(p.query, b"")
568+
self.assertEqual(p.fragment, b"frag")
569+
self.assertEqual(p.username, None)
570+
self.assertEqual(p.password, None)
571+
self.assertEqual(p.hostname, b"www.python.org")
572+
self.assertEqual(p.port, None)
573+
self.assertEqual(p.geturl(), b"http://www.python.org/javascript:alert('msg')/#frag")
574+
546575
def test_attributes_bad_port(self):
547576
"""Check handling of non-integer ports."""
548577
p = urlparse.urlsplit("http://www.example.net:foo")

Lib/urlparse.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@
6262
'0123456789'
6363
'+-.')
6464

65+
# Unsafe bytes to be removed per WHATWG spec
66+
_UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n']
67+
6568
MAX_CACHE_SIZE = 20
6669
_parse_cache = {}
6770

@@ -198,6 +201,10 @@ def urlsplit(url, scheme='', allow_fragments=True):
198201
if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
199202
clear_cache()
200203
netloc = query = fragment = ''
204+
205+
for b in _UNSAFE_URL_BYTES_TO_REMOVE:
206+
url = url.replace(b, "")
207+
201208
i = url.find(':')
202209
if i > 0:
203210
if url[:i] == 'http': # optimize the common case
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
The presence of newline or tab characters in parts of a URL could allow
2+
some forms of attacks.
3+
4+
Following the controlling specification for URLs defined by WHATWG
5+
:func:`urllib.parse` now removes ASCII newlines and tabs from URLs,
6+
preventing such attacks.

0 commit comments

Comments
 (0)