丰富的知识分布在我们日常互动的各种平台上,即通过工作中的融合wiki页面、松弛组、公司知识库、Reddit、Stack Overflow、书籍、时事通讯和同事共享的谷歌文档。掌握所有这些信息来源本身就是一项全职工作。


1.通过Prompt Engineering提供数据

在我们讨论如何扩展ChatGPT之前,让我们看看如何手动扩展ChatGPT以及存在哪些问题。扩展ChatGPT的传统方法是通过即时工程(prompt engineering)。


I will ask you questions based on the following content:
- Start of Content-
Your very long text to give ChatGPT context
- End of Content-





Custom data sources feeding into ChatGPT


当上下文太大时,处理提示限制(GPT-3 Davinci的4096个令牌限制和GPT-4的8000个令牌限制)变得更容易访问,并通过为用户提供与索引交互的方式来解决文本拆分问题。LlamaInde还抽象了从文档中提取相关部分并将其提供给提示的过程。





  • Python≥3.7安装在您的机器上
  • OpenAI API密钥,可在OpenAI网站上找到。您可以使用您的Gmail帐户进行单次登录。
  1. 一些Word文档上传到您的谷歌文档中。LlamaIndex支持许多不同的数据源。在本教程中,我们将演示谷歌文档。


  1. 使用LlamaIndex创建文档数据索引。

  2. 使用自然语言搜索索引。

  3. LlamaIndex将检索相关片段,并将其传递给GPT提示符。LlamaIndex将把原始文档数据转换为便于查询的矢量化索引。它将利用该索引根据查询和数据的匹配程度来查找最相关的部分。然后,信息将加载到提示中,提示将发送给GPT,以便GPT拥有回答您的问题所需的背景。

  4. 在那之后,您可以询问ChatGPT,给定上下文中的提要。



pip install openai
pip install llama-index
pip install google-auth-oauthlib

接下来,我们将导入Python中的库,并在新的main.py文件中设置OpenAI API密钥。

# Import necessary packages
import os
import pickle

from google.auth.transport.requests import Request

from google_auth_oauthlib.flow import InstalledAppFlow
from llama_index import GPTSimpleVectorIndex, download_loader



def authorize_gdocs():
    google_oauth2_scopes = [
    cred = None
    if os.path.exists("token.pickle"):
        with open("token.pickle", 'rb') as token:
            cred = pickle.load(token)
    if not cred or not cred.valid:
        if cred and cred.expired and cred.refresh_token:
            flow = InstalledAppFlow.from_client_secrets_file("credentials.json", google_oauth2_scopes)
            cred = flow.run_local_server(port=0)
        with open("token.pickle", 'wb') as token:
            pickle.dump(cred, token)

要启用Google Docs API并在Google控制台中获取凭据,可以执行以下步骤:

  1. 转到谷歌云控制台网站(Console.Cloud.Google.com)。
  2. 如果你还没有创建一个新项目。您可以通过单击顶部导航栏中的“选择项目”下拉菜单并选择“新建项目”来完成此操作。按照提示为项目命名并选择要与其关联的组织。
  3. 创建项目后,请从顶部导航栏的下拉菜单中进行选择。
  4. 从左侧菜单转到“API和服务”部分,然后单击页面顶部的“+ENABLE APIs and Services”按钮。
  5. 在搜索栏中搜索“Google Docs API”并从结果列表中选择它。
  6. 单击“启用”按钮为您的项目启用API。
  7. 单击OAuth同意屏幕菜单,创建并为您的应用程序命名,例如“mychatbot”,然后输入支持电子邮件,保存并添加范围。




Example folder structure with google credentials in root

设置凭据后,可以从Python项目访问Google Docs API。


Gdoc ID

复制gdoc ID并将它们粘贴到下面的代码中。您可以有N个gdocs进行索引,这样ChatGPT就可以对您的自定义知识库进行上下文访问。我们将使用LlamaIndex库中的GoogleDocsReader插件来加载您的文档。

# function to authorize or download latest credentials 

# initialize LlamaIndex google doc reader 
GoogleDocsReader = download_loader('GoogleDocsReader')

# list of google docs we want to index 
gdoc_ids = ['1ofZ96nWEZYCJsteRfqik_xNQTGFHtnc-7cYrf0dMPKQ']

loader = GoogleDocsReader()

# load gdocs and index them 
documents = loader.load_data(document_ids=gdoc_ids)
index = GPTSimpleVectorIndex(documents)



# Save your index to a index.json file
# Load the index from your saved index.json file
index = GPTSimpleVectorIndex.load_from_disk('index.json')

通过运行下面的代码可以查询索引并获得响应。代码可以很容易地扩展为rest API,该API连接到UI,您可以在UI中通过GPT接口与自定义数据源进行交互。

# Querying the index
while True:
    prompt = input("Type prompt...")
    response = index.query(prompt)


我们将首先直接与vanilla ChatGPT交互,看看它在不注入自定义数据源的情况下生成了什么输出。


INFO:google_auth_oauthlib.flow:"GET /?state=oz9XY8CE3LaLLsTxIz4sDgrHha4fEJ&code=4/0AWtgzh4LlIfmCMEa0t36dse_xoS0fXFeEWKHFiouzTvz4Qwr7T2Pj6anb-GiZ__Wg-hBBg&scope=https://www.googleapis.com/auth/documents.readonly HTTP/1.1" 200 65
INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 175 tokens
Type prompt...who is timothy mugayi hint he is a writer on medium

INFO:root:> [query] Total LLM token usage: 300 tokens
INFO:root:> [query] Total embedding token usage: 14 tokens
Timothy Mugayi is an Engineering Manager at OVO (PT Visionet Internasional), a subsidiary of GRAB. He is also an avid writer on medium.com who writes on technical topics covering python and freelancing side hustling for programmers. Timothy has been coding for over 15 years, building enterprise solutions for large cooperations. During his free time, he enjoys mentoring and coaching.
Type prompt...

Type prompt...Given you know who timothy mugayi is write an interesting introduction about him

Timothy Mugayi is an experienced and accomplished professional with a wealth of knowledge in engineering, coding, and mentoring. He is currently an Engineering Manager at OVO, a subsidiary of GRAB, and has been coding for over 15 years, building enterprise solutions for large cooperations. In his free time, Timothy enjoys writing on technical topics such as Python and freelancing side hustling for programmers on medium.com, as well as mentoring and coaching. With his impressive background and expertise, Timothy is a valuable asset to any organization.



Type prompt...Write a cover letter for timothy mugayi for an upwork python project to build a custom ChatGPT bot with access to external data sources
INFO:root:> [query] Total LLM token usage: 436 tokens
INFO:root:> [query] Total embedding token usage: 30 tokens

Dear [Hiring Manager],

I am writing to apply for the Python project to build a custom ChatGPT bot with access to external data sources. With over 15 years of experience in coding and building enterprise solutions for large corporations, I am confident that I am the ideal candidate for this position.

I am currently an Engineering Manager at OVO (PT Visionet Internasional), a subsidiary of GRAB. I have extensive experience in Python and have been writing on technical topics covering Python and freelancing side hustling for programmers on medium.com. I am also an avid mentor and coach, and I believe that my experience and skillset make me the perfect candidate for this project.

I am confident that I can deliver a high-quality product that meets the requirements of the project. I am also available to discuss the project further and answer any questions you may have.

Thank you for your time and consideration.

Timothy Mugayi
Type prompt...



from langchain import OpenAI
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper


# define anoter LLM explicitly
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

# define prompt configuraiton
# set maximum input size
max_input_size = 4096
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

index = GPTSimpleVectorIndex(
    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper



last_token_usage = index.llm_predictor.last_token_usage





