Exploiting URL Parsing Confusion Vulnerabilities

We discussed this vulnerability during Episode 111 on 17 January 2022

Different URL parser may treat mistakes in the URL differently, leading to behaviour differences that can be used. This research paper focused on five potential areas where parses disagreed on how to understand the URL

  1. Scheme Confusion. This is a missing or invalid scheme, some parsers will assume an http scheme but more interesting is in how they parse the rest following that. Such as treating what might naturally be parsed as the host as the path (and the whole thing as a relative path) or just using an implicit HTTP scheme.
  2. Slash Confusion. Specifically in this case confusion regarding having the wrong number of slashes, browses especially would normalize the URL and accept extras which might trip up other parsers into thinking they are parsing a path instead of the hostname.
  3. Blackslash Confusion. This one comes from a difference of opinion between the WHATWG URL (what most browsers follow) definition, and RFC 3986 (most libraries seem to follow this). In WHATWG the \ and / characters are to be treated as the same, but in the RFC they are different.
  4. URL-Encoded Data Confusion. Hostnames (and anywhere but the scheme) can include url encoded data. So a library making a request will usually resolve the urlencoding to the proper domain, however many of the software libraries would parse out the urlencoded string and return that to the developer meaning the developer would have to be aware of this issue and decode the url before use.
  5. Scheme Mixup. While non-HTTP URLS might look similar, their specification might be slightly different. in http:// you have a special character, # which marks the start of the fragment section, in ldap:// however # recieves no such special treatment, so parsing a url for one scheme improperly can lead to some behavior differences, this was the case in one of the recent log4j issue, where ldap://127.0.0.1#.evilhost.com:1389/a could bypass the ldap hostt whitelist as a http parser would think the host is 127.0.0.1 (allowed) but the actual lookup would respect the ldap scheme and make a request to 127.0.0.1#.evilhost.com instead.