Xử lý file XML trong Python

10/26/2019 09:33:00 PM

Trong phần này, chúng ta sẽ parsing nội dung XML thành dữ liệu để xử lý. Để xử lý XML, ta sẽ sử dụng thư viện Beautifulsoup 4. Đây là một thư viện giúp việc triển khai việc parsing html, xml được nhanh chóng và tiện lợi.

1. Cài đặt Beautifulsoup

Bạn có thể tham khảo hướng dẫn cách cài đặt tại website

http://www.crummy.com/software/BeautifulSoup/bs4/doc

beautiful-soup.

Trên MacOS, có thể cài bằng pip như sau:

$ sudo pip install beautifulsoup4

2. Cài đặt lxml parser

Để parsing xml từ beautifulsoup, tao sử dụng bộ parser xml có tên là lxml . Xem hướng dẫn cài đặt tại

http://www.crummy.com/software/BeautifulSoup/bs4/doc

a-parser

Trên MacOS, có thể cài bằng pip như sau:

sudo pip install lxml

3. Ví dụ về parsing XML

Cho ví dụ sau:

from bs4 import BeautifulSoup as Soup note = '''

<?xml version="1.0" encoding="UTF-8"?>

<breakfast_menu>

<food>

<name>Belgian Waffles</name>

<description>Two of our famous Belgian Waff

les with plenty of real maple syrup</description>

</food>

<food>

<name>Strawberry Belgian Waffles</name>

<description>Light Belgian waffles covered

with strawberries and whipped cream</description>

</food>

</breakfast_menu>

'''

soup = Soup(note, 'xml')

foods = soup.findAll('food')

for x in foods:

print x.find('name').string, ': ', x.price.stri

Khi chạy thì sẽ hiển thị ra màn hình như sau:

Belgian Waffles :

$5.95

Strawberry Belgian Waffles :

$7.95

Đối tượng thuộc class Soup (BeautifulSoup) sẽ giúp truy xuất các thành phần của file xml nhanh chóng và tiện lợi. Trong ví dụ có một số cách truy xuất đến các phần tử như:

findAll() : trả về mảng các thẻ có tên cần tìm

find() : trả về phần tử đầu tiên có tên cần tìm

Truy xuất trực tiếp thông qua tên thẻ như

x.price.string

4. Parsing HTML

Tương tự như xml , BeautifulSoup có thể parsing nội dung HTML thông qua hàm khởi tạo và chọn html ở tham số thứ 2.

soup = Soup(websitehtml, 'html')

Xử lý file XML trong Python

0 comments:

Post a Comment

Blog Archive