原创 

关于HTML文件目录树形结构的爬取算法(python3 实现)

分类:python    763人阅读    IT小君  2021-07-18 11:21

目录结构:


目录html代码如下:

<div class="treeview">
  <ul id="code_tree" class="filetree treeview">
    <li class="collapsable">
      <div class="hitarea collapsable-hitarea"></div>
      <span class="folder">fontawesome</span>
      <ul style="">
        <li class="expandable">
          <div class="hitarea expandable-hitarea"></div>
          <span class="folder">css</span>
          <ul style="display: none">
            <li class="last"><span class="css">all.min.css</span></li>
          </ul>
        </li>
        <li class="expandable lastExpandable">
          <div class="hitarea expandable-hitarea lastExpandable-hitarea"></div>
          <span class="folder">webfonts</span>
          <ul style="display: none">
            <li><span class="file">fa-brands-400.eot</span></li>
            <li><span class="file">fa-brands-400.svg</span></li>
            <li><span class="file">fa-brands-400.ttf</span></li>
          </ul>
        </li>
      </ul>
    </li>
    <li class="expandable">
      <div class="hitarea expandable-hitarea"></div>
      <span class="folder">js</span>
      <ul style="display: none">
        <li class="expandable">
          <div class="hitarea expandable-hitarea"></div>
          <span class="folder">vendor</span>
          <ul style="display: none">
            <li><span class="js">jquery-3.6.0.min.js</span></li>
            <li class="last"><span class="js">modernizr-3.5.0.min.js</span></li>
          </ul>
        </li>
        <li><span class="js">ajax-form.js</span></li>
        <li class="last"><span class="js">wow.min.js</span></li>
      </ul>
    </li>
    <li><span class="html">blog-details.html</span></li>
    <li class="last"><span class="html">team.html</span></li>
  </ul>
</div>



思路:

A、递归处理

B、关键问题是文件的文件目录生成

C、解析分两种场景处理
        1、文件直接添加到文件列表 

        2、目录则将目录名传给递归方法 继续 寻找文件。


python3代码示例:

import os
import re
import threading
import time

import requests
from bs4 import BeautifulSoup

class treeSpider(threading.Thread):
	def __init__(self, threadID, name):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
		
	def parseFilePath(self):
        ......
        r2 = requests.post(url, data)
        soupPath = BeautifulSoup(r2.text, 'html.parser') #r2.text 为文件树html
        lis = soupPath.select('#code_tree>li')
        list = self.parseLis(lis,"")
        print(list)


    def parseLis(self,lis,path):
        fileList = []
        for li in lis:
            name = li.find_next('span')
            if name.get('class')[0] == 'folder':
                ls = li.select('ul>li')
                if not os.path.exists('bootstrapmb/'+path+'/'+name.get_text()):
                    os.makedirs('bootstrapmb/'+path+'/'+name.get_text())
                fileList.extend(self.parseLis(ls,path+'/'+name.get_text()))
            else:
                fileList.append(path+'/'+name.get_text())
        return fileList;

    def run(self):
        if self.name == 'url_getter':
            self.getUrl()
        if self.name == 'detail_getter':
            self.getDetail()
        print(self.name + '跑完了')


if __name__ == '__main__':
    pget = treeSpider(1, 'url_getter')
    pget.parseFilePath()




支付宝打赏 微信打赏

如果文章对你有帮助,欢迎点击上方按钮打赏作者

 工具推荐 更多»