0%

第三周-python爬虫小项目作业

这个是南京大学MOOC <用python玩转数据>的第三周python小项目作业题
用python玩转数据

作业一:

题目:爬取豆瓣的随便一本书的前50热评,并计算所用评分的平均值(注意:有的评论下无评分)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import requests,re,os,time
from bs4 import BeautifulSoup
def crawler():
html = []
url="https://book.douban.com/subject/2567698/comments/hot?p="
for i in range(3):
re=requests.get(url+str(i+1))
html.append(re.text)
time.sleep(5)
html=' '.join(html)
return html

def parse(html):
comments=[]
grades=[]
soup=BeautifulSoup(html,'lxml')
commenttags=soup.select(".short")
for comment in commenttags:
comments.append(comment.string)
starttags=soup.select(".user-stars")
for start in starttags:
#提取评分
grade=int(' '.join(start['class']).split(' ')[1][7:])
grades.append(grade)
mean=sum(grades)/len(grades)
return comments,mean

if __name__=="__main__":
html=crawler()
allcomments,mean=parse(html)
print(mean)
for comment in allcomments:
print(comment)

作业二:

题目:在 “http://money.cnn.com/data/dow30/” 上抓取道指成分股数据并将30家公司的代码、公司名称和最近一次成交价放到一个列表中输出

1
2
3
4
5
6
7
8
9
10
import requests
import re
def retrieve_dji_list():
r = requests.get('http://money.cnn.com/data/dow30/')
# put the re expression on one line and pay attention to the '\n'
search_pattern = re.compile('class="wsod_symbol">(.*?)<\/a>.*?<span.*?">(.*?)<\/span>.*?\n.*?class="wsod_stream">(.*?)<\/span>')
dji_list_in_text = re.findall(search_pattern, r.text)
return dji_list_in_text
dji_list = retrieve_dji_list()
print(dji_list)

Note:这个代码时老师给的参考答案

作业三:

请爬取网页( http://www.volleyball.world/en/vnl/2018/women/results-and-ranking/round1 )上的数据(包括TEAMS and TOTAL, WON, LOST of MATCHES)

1
2
3
4
5
6
7
8
9
10
import requests,re
def crawler(url):
r=requests.get(url)
pattern=re.compile('href="/en/vnl/2018/women/teams/.*?>(.*?)<\/a>\s*</figcaption>\s*</figure>\s*<\/td>\s*.*?(\d+)<\/td>\s*.*?(\d+)<\/td>\s*.*?(\d+)<\/td>')
lis=re.findall(pattern,r.text)
return lis
if __name__ == '__main__':
url="http://www.volleyball.world/en/vnl/2018/women/results-and-ranking/round1"
lis=crawler(url)
print(lis,len(lis))