最近学习 chatGPT 文档的时候,看到这么一篇文章 How to build an AI that can answer questions about your website. 它讲的是如何用你现有的网站来做一个AI bot 来回答有关你网站的问题. 这种场景很常见: 比如你公司有很多很有用的文档存放在某个站点, 或者你有一个专门针对某个主题的blog网站,又或者某个产品的详细使用说明在线文档. 当有了GPT的工具后, 我们把这些站点的内容作为context送给GPT,然后GPT以这些context为基础来回答用户的问题.

下面我们就以我的个人网站为例,以 openai 的chatGPT API为工具, 构建这么一个问答程序.

步骤概括

总的来看, 我们要做下面的一些步骤:

把整个网站下载下来.
把有用的网页文档里面的核心内容取出来.
把这些取出来的核心文本内容做 text embeding, 并放入向量数据库.
当有问题的时候, 先使用问题的内容去搜索向量数据库, 把搜索的结果做为 context, 连同问题一并送给 chatGPT 获取答案.
下面我们就给出具体的代码和步骤.

把整个网站下载下来

使用 wget 命令很容易去下载一个网站.

$ wget -r https://www.tianxiaohui.com 
        ...省略...
FINISHED --2024-01-06 19:42:12--
Total wall clock time: 11m 28s
Downloaded: 3611 files, 133M in 3m 37s (625 KB/s)

通过 -r 参数, wget会把整个网站都下载下来, 并且按照网页的路径分类. 可以看到这里下载了3611个文件, 133M. 但是我的网站明显没有这么多文章, 这里面包含很多图片的链接, 有些分类的页面.

把有用的网页文档里面的核心内容取出来.

通过人工浏览这些页面, 我们可以看到我们只需要特定的含有整篇文章的html页面, 有些分类页面(比如2023年3月的文章合集)是不需要的, 一篇文章的html被加载之后, 我们只需要取其中文章的部分, 不需要菜单链接和旁边的分类链接. 所以我们有下面的代码:

import os

from bs4 import BeautifulSoup


def fetch_docs(folder: str = None):
    # 遍历并过滤以 .html 结尾的文档
    html_docs = [f for f in os.listdir(folder) if f.endswith('.html')]

    txts = []
    for html_doc in html_docs:
        with open(folder + "/" + html_doc, 'r') as file:
            # 使用BeautifulSoup 解析html
            soup = BeautifulSoup(file, 'html.parser')
            # 只取其中文章的部分, 有些分类页面没有文章部分, 这里就会放弃
            post = soup.find('div', class_='post-content')
            if post:
                # 替换掉很多分隔符
                txt = post.get_text().replace("\n", "").replace("\t", " ").replace("\r", "");
                # print(txt) 查看具体文本, 方便跟多加工
                txts.append(txt)
            else:
                print("not find article from " + html_doc)
    print(len(txts))
    return txts
    

fetch_docs(“/Users/eric/Downloads/blogs/www.tianxiaohui.com/index.php/Troubleshooting”)

把这些取出来的核心文本内容做 text embeding, 并放入向量数据库.

我们使用openAI 的 embedding, 并使用 FAISS 做为向量库来进行相似性搜索.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS


embeddings_model = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(
    fetch_docs("/Users/eric/Downloads/blogs/www.tianxiaohui.com/index.php/Troubleshooting"), embedding=embeddings_model
)

从向量数据库获取相关内容调用 GPT 生成答案

首先我们把向量库FAISS设置为 retriever, 然后检索相关文档, 然后把问题和相关文档组成的context 给chatGPT, 获取答案.

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI


question = "如何分析Java应用OOM问题?"
retriever = vectorstore.as_retriever(search_kwargs={"k": 3, "score_threshold": .5})
docs = retriever.get_relevant_documents(question)
doc_contents = [doc.page_content for doc in docs]

prompt = PromptTemplate.from_template("here is the question: {query}, this is the contexts: {context}")
chain = prompt | ChatOpenAI()
result = chain.invoke({"query": question, "context": doc_contents})
print(result)

总结

通过上面几十行代码, 我们就能把一个现有的知识网站, 做成了一个初级版本的可以回答问题的智能AI了.

通过已有网站内容借助GPT来回答问题

步骤概括

把整个网站下载下来

把有用的网页文档里面的核心内容取出来.

把这些取出来的核心文本内容做 text embeding, 并放入向量数据库.

从向量数据库获取相关内容调用 GPT 生成答案

总结

添加新评论