🔥🔥探索人工智能的世界:构建智能问答系统之实战篇 引言 环境

引言

前面我们已经做好了必要的准备工作,包括对相关知识点的了解以及环境的安装。今天我们将重点关注代码方面的内容。如果你已经具备了Java编程基础,那么理解Python语法应该不会成为问题,毕竟只是语法的差异而已。随着时间的推移,你自然会逐渐熟悉和掌握这门语言。现在让我们开始吧!

环境安装命令

在使用之前,我们需要先进行一些必要的准备工作,其中包括执行一些命令。如果你已经仔细阅读了Milvus的官方文档,你应该已经了解到了这一点。下面是需要执行的一些命令示例:

1
2
3
4
5
6
7
8
9
10
11
ini复制代码pip3 install langchain

pip3 install openai

pip3 install protobuf==3.20.0

pip3 install grpcio-tools

python3 -m pip install pymilvus==2.3.2

python3 -c "from pymilvus import Collection"

快速入门

现在,我们来尝试使用官方示例,看看在没有集成LangChain的情况下,我们需要编写多少代码才能完成插入、查询等操作。官方示例已经在前面的注释中详细讲解了所有的流程。总体流程如下:

  1. 连接到数据库
  2. 创建集合(这里还有分区的概念,我们不深入讨论)
  3. 插入向量数据(我看官方文档就简单插入了一些数字…)
  4. 创建索引(根据官方文档的说法,通常在一定数据量下是不会经常创建索引的)
  5. 查询数据
  6. 删除数据
  7. 断开与数据库的连接

通过以上步骤,你会发现与连接MySQL数据库的操作非常相似。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
python复制代码# hello_milvus.py demonstrates the basic operations of PyMilvus, a Python SDK of Milvus.
# 1. connect to Milvus
# 2. create collection
# 3. insert data
# 4. create index
# 5. search, query, and hybrid search on entities
# 6. delete entities by PK
# 7. drop collection
import time

import numpy as np
from pymilvus import (
connections,
utility,
FieldSchema, CollectionSchema, DataType,
Collection,
)

fmt = "\n=== {:30} ===\n"
search_latency_fmt = "search latency = {:.4f}s"
num_entities, dim = 3000, 8

#################################################################################
# 1. connect to Milvus
# Add a new connection alias `default` for Milvus server in `localhost:19530`
# Actually the "default" alias is a buildin in PyMilvus.
# If the address of Milvus is the same as `localhost:19530`, you can omit all
# parameters and call the method as: `connections.connect()`.
#
# Note: the `using` parameter of the following methods is default to "default".
print(fmt.format("start connecting to Milvus"))
connections.connect("default", host="localhost", port="19530")

has = utility.has_collection("hello_milvus")
print(f"Does collection hello_milvus exist in Milvus: {has}")

#################################################################################
# 2. create collection
# We're going to create a collection with 3 fields.
# +-+------------+------------+------------------+------------------------------+
# | | field name | field type | other attributes | field description |
# +-+------------+------------+------------------+------------------------------+
# |1| "pk" | VarChar | is_primary=True | "primary field" |
# | | | | auto_id=False | |
# +-+------------+------------+------------------+------------------------------+
# |2| "random" | Double | | "a double field" |
# +-+------------+------------+------------------+------------------------------+
# |3|"embeddings"| FloatVector| dim=8 | "float vector with dim 8" |
# +-+------------+------------+------------------+------------------------------+
fields = [
FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
FieldSchema(name="random", dtype=DataType.DOUBLE),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]

schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")

print(fmt.format("Create collection `hello_milvus`"))
hello_milvus = Collection("hello_milvus", schema, consistency_level="Strong")

################################################################################
# 3. insert data
# We are going to insert 3000 rows of data into `hello_milvus`
# Data to be inserted must be organized in fields.
#
# The insert() method returns:
# - either automatically generated primary keys by Milvus if auto_id=True in the schema;
# - or the existing primary key field from the entities if auto_id=False in the schema.

print(fmt.format("Start inserting entities"))
rng = np.random.default_rng(seed=19530)
entities = [
# provide the pk field because `auto_id` is set to False
[str(i) for i in range(num_entities)],
rng.random(num_entities).tolist(), # field random, only supports list
rng.random((num_entities, dim)), # field embeddings, supports numpy.ndarray and list
]

insert_result = hello_milvus.insert(entities)

hello_milvus.flush()
print(f"Number of entities in Milvus: {hello_milvus.num_entities}") # check the num_entities

################################################################################
# 4. create index
# We are going to create an IVF_FLAT index for hello_milvus collection.
# create_index() can only be applied to `FloatVector` and `BinaryVector` fields.
print(fmt.format("Start Creating index IVF_FLAT"))
index = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128},
}

hello_milvus.create_index("embeddings", index)

################################################################################
# 5. search, query, and hybrid search
# After data were inserted into Milvus and indexed, you can perform:
# - search based on vector similarity
# - query based on scalar filtering(boolean, int, etc.)
# - hybrid search based on vector similarity and scalar filtering.
#

# Before conducting a search or a query, you need to load the data in `hello_milvus` into memory.
print(fmt.format("Start loading"))
hello_milvus.load()

# -----------------------------------------------------------------------------
# search based on vector similarity
print(fmt.format("Start searching based on vector similarity"))
vectors_to_search = entities[-1][-2:]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 10},
}

start_time = time.time()
result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, output_fields=["random"])
end_time = time.time()

for hits in result:
for hit in hits:
print(f"hit: {hit}, random field: {hit.entity.get('random')}")
print(search_latency_fmt.format(end_time - start_time))

# -----------------------------------------------------------------------------
# query based on scalar filtering(boolean, int, etc.)
print(fmt.format("Start querying with `random > 0.5`"))

start_time = time.time()
result = hello_milvus.query(expr="random > 0.5", output_fields=["random", "embeddings"])
end_time = time.time()

print(f"query result:\n-{result[0]}")
print(search_latency_fmt.format(end_time - start_time))

# -----------------------------------------------------------------------------
# pagination
r1 = hello_milvus.query(expr="random > 0.5", limit=4, output_fields=["random"])
r2 = hello_milvus.query(expr="random > 0.5", offset=1, limit=3, output_fields=["random"])
print(f"query pagination(limit=4):\n\t{r1}")
print(f"query pagination(offset=1, limit=3):\n\t{r2}")

# -----------------------------------------------------------------------------
# hybrid search
print(fmt.format("Start hybrid searching with `random > 0.5`"))

start_time = time.time()
result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, expr="random > 0.5",
output_fields=["random"])
end_time = time.time()

for hits in result:
for hit in hits:
print(f"hit: {hit}, random field: {hit.entity.get('random')}")
print(search_latency_fmt.format(end_time - start_time))

###############################################################################
# 6. delete entities by PK
# You can delete entities by their PK values using boolean expressions.
ids = insert_result.primary_keys

expr = f'pk in ["{ids[0]}" , "{ids[1]}"]'
print(fmt.format(f"Start deleting with expr `{expr}`"))

result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"])
print(f"query before delete by expr=`{expr}` -> result: \n-{result[0]}\n-{result[1]}\n")

hello_milvus.delete(expr)

result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"])
print(f"query after delete by expr=`{expr}` -> result: {result}\n")

###############################################################################
# 7. drop collection
# Finally, drop the hello_milvus collection
print(fmt.format("Drop collection `hello_milvus`"))
utility.drop_collection("hello_milvus")

升级版

现在,让我们来看一下使用LangChain版本的代码。由于我们使用的是封装好的Milvus,所以我们需要一个嵌入模型。在这里,我们选择了HuggingFaceEmbeddings中的sensenova/piccolo-base-zh模型作为示例,当然你也可以选择其他模型,这里没有限制。只要能将其作为一个变量传递给LangChain定义的函数调用即可。

下面是一个简单的示例,包括数据库连接、插入数据、查询以及得分情况的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
python复制代码from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus


model_name = "sensenova/piccolo-base-zh"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

print("链接数据库")
vector_db = Milvus(
embeddings,
connection_args={"host": "localhost", "port": "19530"},
collection_name="hello_milvus",
)
print("简单传入几个值")
vector_db.add_texts(["12345678","789","努力的小雨是一个知名博主,其名下有公众号【灵墨AI探索室】,博客:稀土掘金、博客园、51CTO及腾讯云等","你好啊","我不好"])

print("查询前3个最相似的结果")
docs = vector_db.similarity_search_with_score("你好啊",3)

print("查看其得分情况,分值越低越接近")
for text in docs:
print('文本:%s,得分:%s'%(text[0].page_content,text[1]))

image

注意,以上代码只是一个简单示例,具体的实现可能会根据你的具体需求进行调整和优化。

在langchain版本的代码中,如果你想要执行除了自己需要开启docker中的milvus容器之外的操作,还需要确保你拥有网络代理。这里不多赘述,因为HuggingFace社区并不在国内。

个人定制版

接下来,我们将详细了解如何调用openai模型来回答问题!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
python复制代码from dotenv import load_dotenv
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate;
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chat_models.openai import ChatOpenAI
from langchain.schema import BaseOutputParser

# 加载env环境变量里的key值
load_dotenv()
# 格式化输出
class CommaSeparatedListOutputParser(BaseOutputParser):
"""Parse the output of an LLM call to a comma-separated list."""

def parse(self, text: str):
"""Parse the output of an LLM call."""
return text.strip().split(", ")
# 先从数据库查询问题解
docs = vector_db.similarity_search("努力的小雨是谁?")
doc = docs[0].page_content

chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
template = "请根据我提供的资料回答问题,资料: {input_docs}"
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

# chat_prompt.format_messages(input_docs=doc, text="努力的小雨是谁?")
chain = LLMChain(
llm=chat,
prompt=chat_prompt,
output_parser=CommaSeparatedListOutputParser()
)
chain.run(input_docs=doc, text="努力的小雨是谁?")

当你成功运行完代码后,你将会得到你所期望的答案。如下图所示,这些答案将会展示在你的屏幕上。不然,如果系统不知道这些问题的答案,那它又如何能够给出正确的回答呢?

image

总结

通过本系列文章的学习,我们已经对个人或企业知识库有了一定的了解。尽管OpenAI已经提供了私有知识库的部署选项,但是其高昂的成本对于一些企业来说可能是难以承受的。无论将来国内企业是否会提供个人或企业知识库的解决方案,我们都需要对其原理有一些了解。无论我们的预算多少,都可以找到适合自己的玩法,因为不同预算的玩法也会有所不同。

本文转载自: 掘金

开发者博客 – 和开发相关的 这里全都有

0%