Your IP : 18.117.135.132
B
o��] i � @ s� d Z ddlmZ ddlZddlZyddlmZ ddlmZ W n$ e k
r` ddl
mZmZ Y nX ddlmZ ddl
mZ dd l
mZmZ dd
l
mZmZ ye W n ek
r� eZY nX ye W n ek
r� eZY nX ye W n ek
�r eefZY nX ddd
ddddgZe�dejejB �Ze�dej�Ze�dej�j Z!e�dej�j Z"dd� Z#e�d�j$Z%e�dejejB �Z&e�'d�Z(ej'ddeid�Z)G dd
� d
e*�Z+e+� Z,e,j-Z-e�dej�e�d ej�gZ.d!d"d#d$d%d&gZ/e�d'ej�e�d(ej�e�d)�gZ0d*gZ1e.e/e0e1fd+d�Z2d,d-� Z3d.d� Z4e2j e4_ d"d!d#gZ5d/gZ6d0e5e6ed1�fd2d�Z7d3d� Z8d4d5� Z9e�d6ej�Z:d7d8� Z;dS )9zcA cleanup tool for HTML.
Removes unwanted tags and content. See the `Cleaner` class for
details.
� )�absolute_importN)�urlsplit)�unquote_plus)r r )�etree)�defs)�
fromstring�XHTML_NAMESPACE)�
xhtml_to_html�_transform_result�
clean_html�clean�Cleaner�autolink�
autolink_html�
word_break�word_break_htmlzexpression\s*\(.*?\)z
@\s*importz^data:image/.+;base64z<(?:javascript|jscript|livescript|vbscript|data|about|mocha):c C s t | �rd S t| �S )N)�_is_image_dataurl�_is_possibly_malicious_scheme)�s� r �B/opt/alt/python37/lib64/python3.7/site-packages/lxml/html/clean.py�_is_javascript_schemeN s r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)Z
namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ
dZdZdZ
dZdZdZdZdZdZdZdZejZdZdZddhZdd � Zed
ddd
gd
d
d
dd�Zdd� Zdd� Zdd� Z dd� Z!dd� Z"d"dd�Z#dd� Z$e%�&de%j'�j(Z)dd� Z*d d!� Z+dS )#r
a
Instances cleans the document of each of the possible offending
elements. The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.
``scripts``:
Removes any ``<script>`` tags.
``javascript``:
Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
as they could contain Javascript.
``comments``:
Removes any comments.
``style``:
Removes any style tags.
``inline_style``
Removes any style attributes. Defaults to the value of the ``style`` option.
``links``:
Removes any ``<link>`` tags
``meta``:
Removes any ``<meta>`` tags
``page_structure``:
Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
``processing_instructions``:
Removes any processing instructions.
``embedded``:
Removes any embedded objects (flash, iframes)
``frames``:
Removes any frame-related tags
``forms``:
Removes any form tags
``annoying_tags``:
Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>``
``remove_tags``:
A list of tags to remove. Only the tags will be removed,
their content will get pulled up into the parent tag.
``kill_tags``:
A list of tags to kill. Killing also removes the tag's content,
i.e. the whole subtree, not just the tag itself.
``allow_tags``:
A list of tags to include (default include all).
``remove_unknown_tags``:
Remove any tags that aren't standard parts of HTML.
``safe_attrs_only``:
If true, only include 'safe' attributes (specifically the list
from the feedparser HTML sanitisation web site).
``safe_attrs``:
A set of attribute names to override the default list of attributes
considered 'safe' (when safe_attrs_only=True).
``add_nofollow``:
If true, then any <a> tags will have ``rel="nofollow"`` added to them.
``host_whitelist``:
A list or set of hosts that you can use for embedded content
(for content like ``<object>``, ``<link rel="stylesheet">``, etc).
You can also implement/override the method
``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
implement more complex rules for what can be embedded.
Anything that passes this test will be shown, regardless of
the value of (for instance) ``embedded``.
Note that this parameter might not work as intended if you do not
make the links absolute before doing the cleaning.
Note that you may also need to set ``whitelist_tags``.
``whitelist_tags``:
A set of tags that can be included with ``host_whitelist``.
The default is ``iframe`` and ``embed``; you may wish to
include other tags like ``script``, or you may want to
implement ``allow_embedded_url`` for more control. Set to None to
include all tags.
This modifies the document *in place*.
TFNr �iframe�embedc K sZ x:|� � D ].\}}t| |�s,td||f ��t| ||� q
W | jd krVd|krV| j| _d S )NzUnknown parameter: %s=%r�inline_style)�items�hasattr� TypeError�setattrr �style)�self�kw�name�valuer r r �__init__� s
zCleaner.__init__�src�href�code�object)�script�link�appletr r �layer�ac C s� t |d�r|�� }t|� x|�d�D ]
}d|_q&W | jsD| �|� t| jpNd�}t| j p\d�}t| j
pjd�}| jr~|�d� | j
r�t| j�}x:|�tj�D ]*}|j}x|�� D ]}||kr�||= q�W q�W | j�r2| j
r�| jtjk�s(x@|�tj�D ]0}|j}x$|�� D ]}|�d��r||= �qW q�W |j| jdd� | j�s�x`t|�D ]T}|�d �} t�d
| �}
t�d
|
�}
| �|
��r�|jd = n|
| k�rJ|�d |
� �qJW | j�s2x�t |�d ��D ]t}|�dd
��!� �"� dk�r�|�#� �q�|j$�p�d
} t�d
| �}
t�d
| �}
| �|
��rd
|_$n|
| k�r�|
|_$�q�W | j�sB| j%�rN|�tj&� | j%�rb|�tj'� | j�rt|�d � | j�r�t�(|d � | j)�r�|�d� nT| j�s�| j�r�xBt |�d��D ]0}d|�dd
��!� k�r�| �*|��s�|�#� �q�W | j+�r|�d� | j,�r|�-d� | j.�r�x\t |�d��D ]J}d}|�/� }x$|dk �r`|jdk�r`|�/� }�q>W |dk�r,|�#� �q,W |�-d� |�-d� | j0�r�|�-tj1� | j2�r�|�d� |�-d� | j3�r�|�-d� g }
g }x`|�� D ]T}|j|k�r| �*|��r�q�|�4|� n&|j|k�r�| �*|��r*�q�|
�4|� �q�W |
�rj|
d |k�rj|
�5d�}d|_|j�6� n8|�r�|d |k�r�|�5d�}|jdk�r�d|_|�6� |�7� x|D ]}|�#� �q�W x|
D ]}|�8� �q�W | j9�r�|�r�t:d��ttj;�}|�rtg }x(|�� D ]}|j|k�r|�4|� �qW |�rt|d |k�r\|�5d�}d|_|j�6� x|D ]}|�8� �qbW | j<�r�xdt=|�D ]X}| �>|��s�|�d�}|�r�d|k�r�d d!| k�rq�d"| }nd}|�d|� �q�W dS )#z&
Cleans the document.
�getrootZimageZimgr r* �onF)Zresolve_base_hrefr � �typeztext/javascriptz
/* deleted */r+ Z
stylesheet�rel�meta)�head�html�title�paramN)r, r) )r, )r r r- r) r8 Zform)Zbutton�input�select�textarea)ZblinkZmarqueer Zdivr6 zIIt does not make sense to pass in both allow_tags and remove_unknown_tagsZnofollowz
nofollow z %s z%s nofollow)?r r/ r �iter�tag�comments�kill_conditional_comments�set� kill_tags�remove_tags�
allow_tags�scripts�add�safe_attrs_only�
safe_attrsr ZElement�attrib�keys�
javascriptr �
startswithZ
rewrite_links�_remove_javascript_linkr �_find_styled_elements�get�_css_javascript_re�sub�_css_import_re�_has_sneaky_javascriptr �list�lower�strip� drop_tree�text�processing_instructions�CommentZProcessingInstructionZstrip_attributes�links�
allow_elementr4 �page_structure�update�embeddedZ getparent�framesZ
frame_tags�forms�
annoying_tags�append�pop�clear�reverseZdrop_tag�remove_unknown_tags�
ValueErrorZtags�add_nofollow�_find_external_links�allow_follow)r! �doc�elrA rB rC rG rH Zaname�old�newZfound_parent�parent�_removeZ_kill�badr3 r r r �__call__� s
zCleaner.__call__c C s dS )zF
Override to suppress rel="nofollow" on some anchors.
Fr )r! �anchorr r r rj � s zCleaner.allow_followc C s� |j | jkrdS | j|j }t|ttf�r^x.|D ]&}|�|�}|sFdS | �||�s0dS q0W dS |�|�}|spdS | �||�S dS )z�
Decide whether an element is configured to be accepted or rejected.
:param el: an element.
:return: true to accept the element or false to reject/discard it.
FTN)r= �_tag_link_attrs�
isinstancerS �tuplerN �allow_embedded_url)r! rl �attrZone_attr�urlr r r r[ � s
zCleaner.allow_elementc C s^ | j dk r|j| j krdS t|�\}}}}}|�� �dd�d }|dkrLdS || jkrZdS dS )a
Decide whether a URL that was found in an element's attributes or text
if configured to be accepted or rejected.
:param el: an element.
:param url: a URL found on the element.
:return: true to accept the URL and false to reject it.
NF�:� r )ZhttpZhttpsT)�whitelist_tagsr= r rT �split�host_whitelist)r! rl ry ZschemeZnetloc�pathZqueryZfragmentr r r rw � s
zCleaner.allow_embedded_urlc C s g }| � |dd� tj� dS )z�
IE conditional comments basically embed HTML that the parser
doesn't normally see. We can't allow anything like that, so
we'll kill any comments that could be conditional.
c S s t �| j�S )N)�_conditional_comment_re�searchrW )rl r r r �<lambda>� � z3Cleaner.kill_conditional_comments.<locals>.<lambda>N)�_kill_elementsr rY )r! rk rq r r r r? � s z!Cleaner.kill_conditional_commentsc C sD g }x$|� |�D ]}||�r|�|� qW x|D ]}|�� q0W d S )N)r<