目录结构:
目录html代码如下:
<div class="treeview">
<ul id="code_tree" class="filetree treeview">
<li class="collapsable">
<div class="hitarea collapsable-hitarea"></div>
<span class="folder">fontawesome</span>
<ul style="">
<li class="expandable">
<div class="hitarea expandable-hitarea"></div>
<span class="folder">css</span>
<ul style="display: none">
<li class="last"><span class="css">all.min.css</span></li>
</ul>
</li>
<li class="expandable lastExpandable">
<div class="hitarea expandable-hitarea lastExpandable-hitarea"></div>
<span class="folder">webfonts</span>
<ul style="display: none">
<li><span class="file">fa-brands-400.eot</span></li>
<li><span class="file">fa-brands-400.svg</span></li>
<li><span class="file">fa-brands-400.ttf</span></li>
</ul>
</li>
</ul>
</li>
<li class="expandable">
<div class="hitarea expandable-hitarea"></div>
<span class="folder">js</span>
<ul style="display: none">
<li class="expandable">
<div class="hitarea expandable-hitarea"></div>
<span class="folder">vendor</span>
<ul style="display: none">
<li><span class="js">jquery-3.6.0.min.js</span></li>
<li class="last"><span class="js">modernizr-3.5.0.min.js</span></li>
</ul>
</li>
<li><span class="js">ajax-form.js</span></li>
<li class="last"><span class="js">wow.min.js</span></li>
</ul>
</li>
<li><span class="html">blog-details.html</span></li>
<li class="last"><span class="html">team.html</span></li>
</ul>
</div>
思路:
A、递归处理
B、关键问题是文件的文件目录生成
C、解析分两种场景处理
1、文件直接添加到文件列表
2、目录则将目录名传给递归方法 继续 寻找文件。
python3代码示例:
import os
import re
import threading
import time
import requests
from bs4 import BeautifulSoup
class treeSpider(threading.Thread):
def __init__(self, threadID, name):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
def parseFilePath(self):
......
r2 = requests.post(url, data)
soupPath = BeautifulSoup(r2.text, 'html.parser') #r2.text 为文件树html
lis = soupPath.select('#code_tree>li')
list = self.parseLis(lis,"")
print(list)
def parseLis(self,lis,path):
fileList = []
for li in lis:
name = li.find_next('span')
if name.get('class')[0] == 'folder':
ls = li.select('ul>li')
if not os.path.exists('bootstrapmb/'+path+'/'+name.get_text()):
os.makedirs('bootstrapmb/'+path+'/'+name.get_text())
fileList.extend(self.parseLis(ls,path+'/'+name.get_text()))
else:
fileList.append(path+'/'+name.get_text())
return fileList;
def run(self):
if self.name == 'url_getter':
self.getUrl()
if self.name == 'detail_getter':
self.getDetail()
print(self.name + '跑完了')
if __name__ == '__main__':
pget = treeSpider(1, 'url_getter')
pget.parseFilePath()