XSStrike 可能是目前开源项目中最先进的 XSS 扫描器,阅读这样一个优秀的项目,在XSS自动化检测方法和项目开发方面都能让我们汲取许多宝贵的经验。
XSStrike 简介
XSStrike 主要针对反射型XSS和DOM XSS。它没有使用自动化浏览器,而是全部通过 requests
库进行请求。这意味着它无法测试 js 渲染的网页,而对于DOM XSS的分析,则是完全通过语义分析完成的。
XSStrike 主要分为Scan模式(对给定URL直接进行测试)、Crawling模式Fuzz模式和Bruteforcer模式,配合一些其他的辅助功能,可以很好的完成XSS的扫描。
其中,Crawling 模式用来从目标页面中爬取链接,并对这些链接进行测试。这个模式包括了 XSStrike 的所有核心功能,所以我选择这部分作为首先分析的部分。
爬取测试链接
这部分的功能写在core/photon.py
。约定的返回内容是一个form
的列表。为了方便描述,我把用Golang
结构体描述form
的形式(静态类型真香):
type form struct {
action Url
method string
inputs httpParams
}
对于一个给定的target
,首先分析将target
本身处理成form
的格式:
def rec(target):
... ...
url = getUrl(target, True)
params = getParams(target, '', True)
if '=' in target: # if there's a = in the url, there should be GET parameters
inps = []
for name, value in params.items():
inps.append({'name': name, 'value': value})
forms.append({0: {'action': url, 'method': 'get', 'inputs': inps}})
随后爬取target
的页面内容,分析其中的<form>
标签,获取表单的内容:
response = requester(url, params, headers, True, delay, timeout).text
forms.append(zetanize(response))
def zetanize(response):
... ...
forms = {}
matches = re.findall(r'(?i)(?s)<form.*?</form.*?>',
response) # extract all the forms
num = 0
for match in matches: # everything else is self explanatory if you know regex
page = re.search(r'(?i)action=[\'"](.*?)[\'"]', match)
method = re.search(r'(?i)method=[\'"](.*?)[\'"]', match)
forms[num] = {}
forms[num]['action'] = d(e(page.group(1))) if page else ''
forms[num]['method'] = d(
e(method.group(1)).lower()) if method else 'get'
forms[num]['inputs'] = []
inputs = re.findall(r'(?i)(?s)<input.*?>', response)
for inp in inputs:
inpName = re.search(r'(?i)name=[\'"](.*?)[\'"]', inp)
if inpName:
inpType = re.search(r'(?i)type=[\'"](.*?)[\'"]', inp)
inpValue = re.search(r'(?i)value=[\'"](.*?)[\'"]', inp)
inpName = d(e(inpName.group(1)))
inpType = d(e(inpType.group(1)))if inpType else ''
inpValue = d(e(inpValue.group(1))) if inpValue else ''
if inpType.lower() == 'submit' and inpValue == '':
inpValue = 'Submit Query'
inpDict = {
'name': inpName,
'type': inpType,
'value': inpValue
}
forms[num]['inputs'].append(inpDict)
num += 1
return forms
注意这里用来提取<form>
标签属性的正则表达式:以r'(?i)action=[\'"](.*?)[\'"]'
为例,只能匹配action="xxxx"
或action='xxxx'
这种形式。然而实际上,存在属性值两边没有引号的情况。这时就无法正确处理了,是一个有待改进的地方。
之后,匹配页面内容中<a>
标签,将过滤掉静态文件类型后缀后的链接送回到rec
函数来爬取下一层目标。(这部分的正则也同样存在上述问题)
matches = re.findall(r'<[aA].*href=["\']{0,1}(.*?)["\']', response)
for link in matches: # iterate over the matches
# remove everything after a "#" to deal with in-page anchors
link = link.split('#')[0]
if link.endswith(('.pdf', '.png', '.jpg', '.jpeg', '.xls', '.xml', '.docx', '.doc')):
pass
else:
... ...
就这样,要测试的链接就被爬取完成了。
PS:上面用到的提取页面测试链接和分析<form>
的功能,作者都已经单独提取出来作为独立的 Python 的库:Photon、Zetanize
进行测试
这部分内容写在modes/crawl.py
,主要的逻辑是:
- 先遍历参数,寻找并记录输出点和输出点的语境
- 根据语境,选择可能造成 XSS 的特殊字符,并检查是否被过滤
- 生成payload
寻找输出点
遍历目标的每个参数,将参数值设定为一个特殊字符串然后发送请求。
def crawl(scheme, host, main_url, form, blindXSS, blindPayload, headers, delay, timeout, encoding):
if form:
for each in form.values():
url = each['action']
if url:
... ...
if url not in core.config.globalVariables['checkedForms']:
core.config.globalVariables['checkedForms'][url] = []
method = each['method']
GET = True if method == 'get' else False
inputs = each['inputs']
paramData = {}
for one in inputs:
paramData[one['name']] = one['value']
for paramName in paramData.keys():
if paramName not in core.config.globalVariables['checkedForms'][url]:
core.config.globalVariables['checkedForms'][url].append(paramName)
paramsCopy = copy.deepcopy(paramData)
paramsCopy[paramName] = xsschecker
response = requester(url, paramsCopy, headers, GET, delay, timeout)
将response
送入htmlParser
进行分析,通过设置的特殊字符串,计算输出点的个数,删除空白注释(记住这个操作):
occurences = htmlParser(response, encoding)
if vectors:
for confidence, vects in vectors.items():
try:
payload = list(vects)[0]
logger.vuln('Vulnerable webpage: %s%s%s' %
(green, url, end))
logger.vuln('Vector for %s%s%s: %s' %
(green, paramName, end, payload))
break
except IndexError:
passdef htmlParser(response, encoding):
rawResponse = response # raw response returned by requests
response = response.text # response content
if encoding: # if the user has specified an encoding, encode the probe in that
response = response.replace(encoding(xsschecker), xsschecker)
reflections = response.count(xsschecker)
position_and_context = {}
environment_details = {}
clean_response = re.sub(r'<!--[.\s\S]*?-->', '', response)
首先查找<script></script>
中的输出点,并检查输出点是在否在引号内包裹:
script_checkable = clean_response
for script in extractScripts(script_checkable):
occurences = re.finditer(r'(%s.*?)$' % xsschecker, script)
if occurences:
for occurence in occurences:
thisPosition = occurence.start(1)
position_and_context[thisPosition] = 'script'
environment_details[thisPosition] = {}
environment_details[thisPosition]['details'] = {'quote' : ''}
for i in range(len(occurence.group())):
currentChar = occurence.group()[i]
if currentChar in ('/', '\'', '`', '"') and not escaped(i, occurence.group()):
environment_details[thisPosition]['details']['quote'] = currentChar
elif currentChar in (')', ']', '}', '}') and not escaped(i, occurence.group()):
break
script_checkable = script_checkable.replace(xsschecker, '', 1)
然后查找标签内部的输出点,并分析输出位置是属性名、属性值还是flag
:
if len(position_and_context) < reflections:
attribute_context = re.finditer(r'<[^>]*?(%s)[^>]*?>' % xsschecker, clean_response)
for occurence in attribute_context:
match = occurence.group(0)
thisPosition = occurence.start(1)
parts = re.split(r'\s', match)
tag = parts[0][1:]
for part in parts:
if xsschecker in part:
Type, quote, name, value = '', '', '', ''
if '=' in part:
quote = re.search(r'=([\'`"])?', part).group(1)
name_and_value = part.split('=')[0], '='.join(part.split('=')[1:])
if xsschecker == name_and_value[0]:
Type = 'name'
else:
Type = 'value'
name = name_and_value[0]
value = name_and_value[1].rstrip('>').rstrip(quote).lstrip(quote)
else:
Type = 'flag'
position_and_context[thisPosition] = 'attribute'
environment_details[thisPosition] = {}
environment_details[thisPosition]['details'] = {'tag' : tag, 'type' : Type, 'quote' : quote, 'value' : value, 'name' : name}
最后查找输出在标签外和注释符号内的输出点:
if len(position_and_context) < reflections:
html_context = re.finditer(xsschecker, clean_response)
for occurence in html_context:
thisPosition = occurence.start()
if thisPosition not in position_and_context:
position_and_context[occurence.start()] = 'html'
environment_details[thisPosition] = {}
environment_details[thisPosition]['details'] = {}
if len(position_and_context) < reflections:
comment_context = re.finditer(r'<!--[\s\S]*?(%s)[\s\S]*?-->' % xsschecker, response)
for occurence in comment_context:
thisPosition = occurence.start(1)
position_and_context[thisPosition] = 'comment'
environment_details[thisPosition] = {}
environment_details[thisPosition]['details'] = {}
这部分同样有几个问题:
-
和上面说的一样,属性值两边不一定有引号
-
thisPosition
的参考对象发生了变化可以看到,输出点在
<script>
标签内的情况下,thisPositon
是输出点在<script>
标签内字符串的位置;而输出点在标签内部的情况下,thisPosition
是输出点在clean_response
这个字符串的位置。这就可能导致thisPosition
重复,导致position_and_context
和environment_details
中的记录被覆盖,同样也导致在计算标签外的输出点时,<script>
包裹的输出点被重复计算。可能导致注释内的输出点被忽略,以及性能上的浪费
当然,并不是所有输出点都是可以直接利用的,所以接下来还要标记一些包裹在特殊标签(badTag
)内的输出点,:
bad_contexts = re.finditer(r'(?s)(?i)<(style|template|textarea|title|noembed|noscript)>[.\s\S]*(%s)[.\s\S]*</\1>' % xsschecker, response)
non_executable_contexts = []
for each in bad_contexts:
non_executable_contexts.append([each.start(), each.end(), each.group(1)])
if non_executable_contexts:
for key in database.keys():
position = database[key]['position']
badTag = isBadContext(position, non_executable_contexts)
if badTag:
database[key]['details']['badTag'] = badTag
else:
database[key]['details']['badTag'] = ''
最后的database
就是包括了我们需要的所有输出点信息的字典。
检查输出点的过滤情况
我们之前已经获取到了各个输出点的语境,详细到包括是否在<script>
标签包裹内、是否在标签内、是属性名还是属性值、是否在单双引号、反引号的包裹内。那么按照我们正常的 XSS 挖掘思路,接下来肯定要具体分析各个语境的输出点能否逃逸出来,造成XSS。
这部分的功能就是通过filterChecker
完成的
efficiencies = filterChecker(url, paramsCopy, headers, GET, delay, occurences, timeout, encoding)
首先根据语境,选择逃逸需要的符号
def filterChecker(url, params, headers, GET, delay, occurences, timeout, encoding):
positions = occurences.keys()
sortedEfficiencies = {}
# adding < > to environments anyway because they can be used in all contexts
environments = set(['<', '>'])
for i in range(len(positions)):
sortedEfficiencies[i] = {}
for i in occurences:
occurences[i]['score'] = {}
context = occurences[i]['context']
if context == 'comment':
environments.add('-->')
elif context == 'script':
environments.add(occurences[i]['details']['quote'])
environments.add('</scRipT/>')
elif context == 'attribute':
if occurences[i]['details']['type'] == 'value':
if occurences[i]['details']['name'] == 'srcdoc': # srcdoc attribute accepts html data with html entity encoding
environments.add('<') # so let's add the html entity
environments.add('>') # encoded versions of < and >
if occurences[i]['details']['quote']:
environments.add(occurences[i]['details']['quote'])
然后送入checker
检测这些字符有没有被过滤
for environment in environments:
if environment:
efficiencies = checker(url, params, headers, GET, delay, environment, positions, timeout, encoding)
efficiencies.extend([0] * (len(occurences) - len(efficiencies)))
for occurence, efficiency in zip(occurences, efficiencies):
occurences[occurence]['score'][environment] = efficiency
return occurences
我们这里先自己思考一下checker
应该做些什么,自然就是将各个特殊字符作为参数值发送请求,查看response
里是否存在未转义的输出。为了方便我们定位输出点,我们需要在被测试的特殊字符前后包裹上特殊的标记。
看起来是一个很简单的逻辑,然而在checker
的实际实现却是令人困惑的,并且缺存在一个很致命的BUG。
我们先关注checker
的返回值list(filter(None, efficiencies))
,可以看到返回的内容是特殊字符未被过滤的可能性的列表,这其中还过滤掉了概率为0的结果。
那么结合filterChecker
对checker
返回值处理部分的代码看,最重要的就是将list(filter(None, efficiencies))
里面的内容正确对应到输出点:
for occurence, efficiency in zip(occurences, efficiencies):
occurences[occurence]['score'][environment] = efficiency
然而要实现这个对应会面临一些问题,比如在我们的测试字符被拦截导致整个输出点失效(而不单是测试字符被过滤),这就会导致输出点少了一个,计算的efficiency
自然也会少一个,这时如果直接返回,输出点和efficiency
的对应关系就会错乱。而为了解决这个问题,就有了fileHoles
这个函数:
def fillHoles(original, new):
filler = 0
filled = []
for x, y in zip(original, new):
if int(x) == (y + filler):
filled.append(y)
else:
filled.extend([0, y])
filler += (int(x) - y)
return filled
def checker(url, params, headers, GET, delay, payload, positions, timeout, encoding):
checkString = 'st4r7s' + payload + '3nd'
response = requester(url, replaceValue(params, xsschecker, checkString, copy.deepcopy), headers, GET, delay, timeout).text.lower()
reflectedPositions = []
for match in re.finditer('st4r7s', response):
reflectedPositions.append(match.start())
filledPositions = fillHoles(positions, reflectedPositions)
可以看到,为了解决这个问题,fileHoles
接受了原有的输出点的位置作为参数,和新得到的输出点参数一一对比,如果newPosition<oldPosition
,就说明这个新的输出点前面缺少了一个输出点,那么就用0来补齐。
但是这里出现了一个致命问题:和之前说的问题类似,就是position
的相对值的参考对象的问题。我们的oldPosition
都是输出点在clean_response
(去除了空白注释的response
),而newPosition
则是在response
中的位置,这就导致几乎所有的oldPosition
都大于newPosition
。当然要解决这里不只是去除空白注释这么简单,还有很多因素会影响这个position
,这里就不细说了。
我们继续往下看
num = 0
efficiencies = []
for position in filledPositions:
allEfficiencies = []
try:
reflected = response[reflectedPositions[num]
:reflectedPositions[num]+len(checkString)]
efficiency = fuzz.partial_ratio(reflected, checkString.lower())
allEfficiencies.append(efficiency)
except IndexError:
pass
if position:
reflected = response[position:position+len(checkString)]
if encoding:
checkString = encoding(checkString.lower())
efficiency = fuzz.partial_ratio(reflected, checkString)
if reflected[:-2] == ('\\%s' % checkString.replace('st4r7s', '').replace('3nd', '')):
efficiency = 90
allEfficiencies.append(efficiency)
efficiencies.append(max(allEfficiencies))
else:
efficiencies.append(0)
num += 1
return list(filter(None, efficiencies))
重点就是最后一行,这里return
的时候无差别直接过滤掉了空值… 结合上面说的问题,这就导致了一个致命的结果:在filterChecker
中,最后将输出点和efficiency
的对应的时候,对应关系出现了混乱,原本不能逃逸的输出点,被误认为可以逃逸。
这里的前因后果跨越比较大,光这么看可能不能理解,我这里提供一个DEMO,可以用来验证这个BUG。
生成Payload
抛开上面说的问题,我们先继续往下看。我们现在有了各个输出点,并且知道了输出点逃逸所需要的字符是否被过滤,那么自然就可以生成Payload了。
vectors = generator(occurences, response.text)
这部分的思路比较简单,就不分析全部代码了,仅以输出内容在标签外为例
vectors = generator(occurences, response.text)
def generator(occurences, response):
scripts = extractScripts(response)
index = 0
vectors = {11: set(), 10: set(), 9: set(), 8: set(), 7: set(),6: set(), 5: set(), 4: set(), 3: set(), 2: set(), 1: set()}
for i in occurences:
context = occurences[i]['context']
if context == 'html':
lessBracketEfficiency = occurences[i]['score']['<']
greatBracketEfficiency = occurences[i]['score']['>']
ends = ['//']
badTag = occurences[i]['details']['badTag'] if 'badTag' in occurences[i]['details'] else ''
if greatBracketEfficiency == 100:
ends.append('>')
if lessBracketEfficiency:
payloads = genGen(fillings, eFillings, lFillings,
eventHandlers, tags, functions, ends, badTag)
for payload in payloads:
vectors[10].add(payload)
elif context == 'attribute':
... ...
elif context == 'comment':
... ...
elif context == 'script':
... ...
return vectors
重点关注genGen
,这才是真正组合出Payload的函数,它接收下列的全局变量,结合该输出点要逃逸所必须的字符,拼接出Payload:
tags = ('html', 'd3v', 'a', 'details') # HTML Tags
# "Things" that can be used between js functions and breakers e.g. '};alert()//
jFillings = (';')
# "Things" that can be used before > e.g. <tag attr=value%0dx>
lFillings = ('', '%0dx')
# "Things" to use between event handler and = or between function and =
eFillings = ('%09', '%0a', '%0d', '+')
fillings = ('%09', '%0a', '%0d', '/+/') # "Things" to use instead of space
eventHandlers = { # Event handlers and the tags compatible with them
'ontoggle': ['details'],
'onpointerenter': ['d3v', 'details', 'html', 'a'],
'onmouseover': ['a', 'html', 'd3v']
}
functions = ( # JavaScript functions to get a popup
'[8].find(confirm)', 'confirm()',
'(confirm)()', 'co\u006efir\u006d()',
'(prompt)``', 'a=prompt,a()')
def genGen(fillings, eFillings, lFillings, eventHandlers, tags, functions, ends, badTag=None):
vectors = []
r = randomUpper # randomUpper randomly converts chars of a string to uppercase
for tag in tags:
if tag == 'd3v' or tag == 'a':
bait = xsschecker
else:
bait = ''
for eventHandler in eventHandlers:
# if the tag is compatible with the event handler
if tag in eventHandlers[eventHandler]:
for function in functions:
for filling in fillings:
for eFilling in eFillings:
for lFilling in lFillings:
for end in ends:
if tag == 'd3v' or tag == 'a':
if '>' in ends:
end = '>' # we can't use // as > with "a" or "d3v" tag
breaker = ''
if badTag:
breaker = '</' + r(badTag) + '>'
vector = breaker + '<' + r(tag) + filling + r(
eventHandler) + eFilling + '=' + eFilling + function + lFilling + end + bait
vectors.append(vector)
return vectors
这里重点说两个值得注意的地方,一个是badTag
的处理,另一个是vectors
的结构,他把Payload的等级由高到低划分为11个等级,越高级的Payload攻击效果越好(如不需要用户交互的Payload等级高)。最后输出结果的时候,自然也是优先输出等级高的Payload:
if vectors:
for confidence, vects in vectors.items():
try:
payload = list(vects)[0]
logger.vuln('Vulnerable webpage: %s%s%s' %
(green, url, end))
logger.vuln('Vector for %s%s%s: %s' %
(green, paramName, end, payload))
break
except IndexError:
pass
总结
写到这里Crawling
模式就基本分析完了,整个XSStrike
扫描挖掘漏洞的原理部分也就讲清楚了。剩下的就是其他模式的功能,以及一些其他辅助性模块(如WafDetect
)的实现了,后面再慢慢讲。
可以看到 XSStrike 检测XSS的原理其实还是很简单的,可以说就是将我们正常手动挖掘XSS漏洞的过程转换成代码,相比其他扫描器花里胡哨的实现,更简单,却也更有效~(再吐槽一句,XSStrike 的代码质量是真的高,项目结构清楚,代码简介清晰,比某些项目的代码质量高不知道多少倍…)
不过也可以看到,XSStrike 还是存在一些BUG和有待优化的地方的,当然这些问题也是业界难题,我想了很久也没有想到用来替代fileHoles
的有效方案,期待有想法的朋友们一起探讨。
PS:写到最后发现自己还是不太擅长写技术类文章,更不擅长写源码笔记这种文章,总想着自己用语言去描述自己的理解,却又感觉描述不清楚,反而搞得很啰嗦。还是要多练练吧~