[Python] BeautifulSoup Library | BeautifulSoup 라이브러리

BeautifulSoup Library

BeautifulSoup 라이브러리

- http를 호출하는데 관련된 기능들을 제공하는 HTML Parser 라이브러리이다.

* BeautifulSoup Official Documentation (URL)

Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio

beautiful-soup-4.readthedocs.io

Installation (설치)

* Installation Command (설치 명령어) (URL)

# for Debian or Ubuntu Linux
apt-get install python-bs4	# for Python 2
apt-get install python3-bs4	# for Python 3

# for pip
pip install beautifulsoup4	# for Python 2 or Python 3

# for easy_install
easy_install beautifulsoup4	# for Python 2 or Python 3

* Import (모듈 임포트)

from bs4 import BeautifulSoup

BeautifulSoup Constructors (BeautifulSoup 생성자)

Constructor	Description
BeautifulSoup(markup, "html.parser")	- Python의 \(\texttt{html.parser}\) Parser - Batteries included - Decent speed (그러나 LXML보다는 느림) - Lenient (그러나 html5lib보다는 엄격함)
BeautifulSoup(markup, "lxml")	- lxml의 HTML Parser - Very fast - Lenient - External C Dependency
BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")	- lxml의 XML Parser - Very fast - The only currently supported XML parser - External C Dependency
BeautifulSoup(markup, "html5lib")	- html5lib Parser - Extremely lenient - Parses pages the same way a web browser does - Creates valid HTML5 - Very slow - External Python dependency

* lxml (URL)

- Python에서 HTML, XML Parser 기능을 제공하는 라이브러리이다.

BeaufifulSoup Attribute (BeaufitulSoup 속성)

soup = BeautifulSoup('HTML_Source')

Attribute	Description
soup.<tag-name>	- <tag-name> 태그의 모든 내용(Tag와 Content)이 저장되어 있다.
soup.<tag-name>.name	- <tag-name> 태그의 이름이 저장되어 있다.
soup.<tag-name>.string	- <tag-name> 태그의 Contents가 저장되어 있다.
soup.<tag-name>.parent	- <tag-name> 태그의 부모 태그의 모든 내용이 저장되어 있다.
soup.<tag-name>.attrs	- <tag-name> 태그의 속성값들이 딕셔너리로 저장되어 있다.
soup.<tag-name>[<attr-name>]	- <tag-name> 태그의 <attr-name> 속성값이 저장되어 있다. (HTML Attribute)

BeaufifulSoup Operation (BeaufitulSoup 연산)

soup = BeautifulSoup('HTML_Source')

Operator	Description
del soup.<tag-name>[<attr-name>]	- <tag-name> 태그의 <attr-name> 속성을 제거한다.

BeautifulSoup Methods (BeautifulSoup 메소드)

* Basis

soup = BeautifulSoup('HTML_Source')

soup.prettify(encoding, formatter)
soup.find(name, attrs, recursive, string, **kwargs)
soup.find_all(name, attrs, recursive, string, limit, **kwargs)
soup.get(key, default)
soup.get_text(separator, strip, types)
soup.select(selector, namespaces, limit, **kwargs)
soup.clear(decompose)

prettify Method

@overload
def prettify(
    self,
    encoding: str,
    formatter: str | Formatter = ...
) -> bytes: ...

@overload
def prettify(
    self,
    encoding: None = ...,
    formatter: str | Formatter = ...
) -> str: ...

- Beautiful Soup Parse Tree를 보기 좋게 Formatted된 Unicode String으로 출력한다.
(구버전에서는 bytestring으로 출력한다.)

- 각각의 Tag와 String들은 개행되어 출력된다.
- prettify()는 HTML/XML 문서에 개행문자 등의 Whitespace를 추가하여 내용을 변질시키기 때문에
이 함수는 HTML/XML 문서의 내용을 눈으로 확인할 용도로만 사용해야 한다.

Parameter	Description
encoding	- 출력할 문자열의 Encoding 형식을 지정한다.
formatter	- formatter에는 지정할 수 있는 5가지 옵션이 있다. formatter="minimal" (Default) - Beautiful Soup이 유효한 HTML/XML을 생성할 수 있을 정도로만 처리하게 한다. formatter="html" - 가능한 경우, Unicode 문자를 HTML Entity로 변환한다. formatter="html5" - formatter="html" 와 유사하게 동작한다. - HTML Void Tag에 Closing Slash를 생략한다. - Empty String("")으로 초기화 된 Attribute는 Boolean Attribute로 변환한다. formatter=None - Beautiful Soup이 출력되는 모든 String을 수정할 수 없게 한다. - 출력 속도 향상을 위한 옵션이며, Invalid한 HTML/XML이 생성될 우려가 있다. formatter=foramtter - 보다 섬세한 조작을 위한 옵션으로, Beautiful Soup의 Formatter Class를 옵션으로 지정하는 경우이다. - 대·소문자 변환, 여백 지정, 속성값 정렬 등이 가능하다.

find Method

def find(
    self,
    name: _Strainable | None = ...,
    attrs: dict[str, _Strainable] | _Strainable = ...,
    recursive: bool = ...,
    string: _Strainable | None = ...,
    **kwargs: _Strainable
) -> Tag | NavigableString | None: ...

- soup 객체의 맨 앞에 존재하는 name 태그를 반환한다.

- 찾지 못한 경우, None을 반환한다.

Parameter	Description
name	- 찾고자 하는 태그의 이름이다.
attrs	- 찾고자 하는 태그의 조건이다. - {속성명 : 값} 형태의 딕셔너리 값 이어야 한다.
recursive	- name에 해당되는 태그를 찾을 때, self 객체의 하위 레벨 전부를 탐색할 지에 대한 여부이다. - Default는 True이다. ex) soup.html,find_all("title") : html Tag 내부 전부를 탐색하며 title 태그를 찾아낸다. ex) soup.html,find_all("title", recursive=False) : html Tag의 Directly Beneath만을 탐색하며 title 태그를 찾아낸다.
string	- 내용 중 string을 포함하는 내용을 탐색할 때 사용한다. - string에는 문자열, 정규표현식, 리스트, 함수, True가 올 수 있다.
**kwargs	- Keyword Arguments

find_all Method

def find_all(
    self,
    name: _Strainable | None = ...,
    attrs: dict[str, _Strainable] | _Strainable = ...,
    recursive: bool = ...,
    string: _Strainable | None = ...,
    limit: int | None = ...,
    **kwargs: _Strainable
) -> ResultSet[Any]: ...

- soup 객체에 존재하는 모든 name 태그들을 담은 리스트를 반환한다.

- 찾지 못한 경우, Empty List를 반환한다.

Parameter	Description
name	- 찾고자 하는 태그의 이름이다.
attrs	- 찾고자 하는 태그의 조건이다. - {속성명 : 값} 형태의 딕셔너리 값 이어야 한다.
recursive	- name에 해당되는 태그를 찾을 때, self 객체의 하위 레벨 전부를 탐색할 지에 대한 여부이다. - Default는 True이다. ex) soup.html.find_all("title") : html Tag 내부 전부를 탐색하며 title 태그를 찾아낸다. ex) soup.html.find_all("title", recursive=False) : html Tag의 Directly Beneath만을 탐색하며 title 태그를 찾아낸다.
string	- 내용 중 string을 포함하는 내용을 탐색할 때 사용한다. - string에는 문자열, 정규표현식, 리스트, 함수, True가 올 수 있다.
limit	- 반환받고자 하는 결과의 개수이다. - limit=1의 경우, 반환받는 결과가 find() 함수와 동일하게 된다.
**kwargs	- Keyword Arguments

get Method

def get(
    self,
    key: str,
    default: str | list[str] | None = ...
) -> str | list[str] | None: ...

- 태그의 속성값을 반환한다.

Parameter	Description
key	- 검색할 Attribute(속성)의 이름
default

get_text Method

def get_text(
    self,
    separator: str = ...,
    strip: bool = ...,
    types: tuple[type[NavigableString], ...] = ...
) -> str: ...

- 태그에 속해있는 Text Part만을 추출하여 Single Unicode String으로 반환한다.

Parameter	Description
separator	- 각각의 태그에서 추출한 문자열들 사이에 삽입할 문자이다.
strip	- 문자열 전후로 Whitespace를 제거할 지에 대한 여부이다. - Default는 False이다. (Whitespace를 제거하지 않는것이 기본이다.)
types

select Method

def select(
    self,
    selector: str,
    namespaces: Any | None = ...,
    limit: int | None = ...,
    **kwargs
) -> ResultSet[Tag]:...

Parameter	Description
selector
namespaces
limits
**kwargs

clear Method

def clear(
    self,
    decompose: bool = ...
) -> None:...

Parameter	Description
decompose

Reference: Beautiful Soup Documentation, Read the Docs, 2022.08.08 검색 (URL)

Reference: Beautiful Soup Documentation, Crummy, 2022.08.08 검색 (URL)

저작자표시 (새창열림)

티스토리툴바