Tuesday, July 28, 2015

How to extract URLs from sitemap.xml using a Python script

Very often, I need to extract URLs from a XML sitemap. URLs can be extracted out of sitemap probably using dozens of different Unix tools. Here the task is solved using Python3 and Beautiful Soup 4.

#!/usr/bin/python3

from bs4 import BeautifulSoup

f=open('sitemap.xml','r')
soup=BeautifulSoup(f)
urls=soup.findAll('loc')
for url in urls:
     print(url.contents[0])