记一次爬虫

距离403已经过去了两小时左右,我又可以访问了!!

本着做一个能让博客背景随机图片的api的想法,我开始了人生中第一次爬虫。上个星期刚学的Python爬虫,还没学完技术属实有点不到位,测试的次数多了点,然后就被封ip了!!

20203241-8352527

以图片形式存bing每日一图

在网上看了很多bing图片的api,在网上找了一个很良心的站长的图片开始了我的爬虫之路

import requests
from bs4 import BeautifulSoup
import os

def get_html(url):
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36'}
    r = requests.get(url, headers=headers)
    return r.content

def download(text):
    soup = BeautifulSoup(text, 'html.parser')
    items = soup.find_all("img")
    path = "E:bing"
    if os.path.exists(path) == False:
        os.makedirs(path)
    for item in items:
        if item:
            html = requests.get(item.get('src'))
            img_name = path + str(item.get('src'))[24:41] +'.png'
            with open(img_name, 'wb') as file:
                file.write(html.content)
                file.flush()
            file.close()
                

def main():
    for i in range(1,124):
        url = 'https://bing.ioliu.cn/ranking?p={}'.format(i)
        text = get_html(url)
        download(text)
        print("第{}页下载完成".format(i))

if __name__ == "__main__":
    main()

这段代码是把他以照片的形式下载下来,我转念一想,我还要全部上传到服务器一个个的获取图片的url,太过于复杂,于是我就……

爬取每张图片的url

import requests
from bs4 import BeautifulSoup
import os

def get_html(url):
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36'}
    r = requests.get(url, headers=headers)
    return r.content

def download(text):
    soup = BeautifulSoup(text, 'html.parser')
    items = soup.find_all("img")
    for item in items:
        if item:
            img = str(item.get('src'))
            img_url = img.strip('640x480.jpg?imageslim')
            img_url += '1920x1080.jpg?imageslim'
            f = open('imgurl.txt', 'a')
            f.write(img_url + 'n')
            f.close



def main():
    for i in range(1,124):
        url = 'https://bing.ioliu.cn/ranking?p={}'.format(i)
        text = get_html(url)
        download(text)
        print("第{}页下载完成".format(i))

if __name__ == "__main__":
    main()

我就把他所有的图片url给存下来了,没错文章头图就是爬取后的结果,然后正在我感受到喜悦的同时,我的ip被封了,不过这些图片还都可以访问,没算白干。技术不到位,还有待提升!

不过最终我还是获得了我想要的随机api

1474张呢,够我玩耍了。

http://www.wu555.ink/random