Chatbot Note

views 282 words

Flow

predict

pdf converts to dataframe.

dataframe expends.

sent to best score calculator(calculate the query and each paragraph’s similarity) which includes vectorizer(will call tf-idf/bm25 transformer) which will generate score matrix. to get best n matched context&query

generate_squad_examples - [{‘title’: ‘2018_2019_annual’, ‘paragraphs’: [{‘context’: ‘Group total revenue decreased by 16% to $8,415 million (201718: $9,988 million)‘, ‘qas’: [{‘answers’: [], ‘question’: ‘What is the revenue of smartone’, ‘id’: ‘107ffb72-042a-4918-882c-6ee7ae601292’, ‘retriever_score’: array([9.10097739])}]}]},

examples, features = bertProcessor(X=squad_examples,is_training=False)

reader = joblib.load(‘./models1/bert_qa.joblib’)

prediction = reader.predict( X=(examples, features), )

train

download squad dataset

train_processor = BertProcessor(bert_model=“bert-base-uncased”) train_examples, train_features = train_processor.fit_transform(X=‘./data/SQuAD_1.1/train-v1.1.json’)

reader = BertQA(train_batch_size=12, bert_model=“bert-base-uncased”, learning_rate=3e-5, num_train_epochs=2, do_lower_case=True, output_dir=‘models’, verbose_logging=True )

reader.fit(X=(train_examples, train_features))

reader.model.to(‘cuda’) reader.device = torch.device(‘cuda’)

joblib.dump(reader, ‘/content/gdrive/My Drive/bert_qa_cuda.joblib’)

  1. Scraper
    • financial annual report
  2. Converter
    • convert PDFs to Dataframe with columns as title & paragraphs
  3. text_transformer (BM25)
  4. BM25Vectorizer
    • Convert a collection of raw documents to a matrix of BM25 features and computes scores of the documents based on a query
  5. Retriever (BM25)
    • train a matrix based on BM25 statistics from a corpus of documents then finds the most N similar documents of a given input query by computing the BM25 score for each document based on the query.(The higher BM25 score, the higher relevant)
  6. Squader/Transformer (bertprocessor)
    • convert SQuAD examples to BertQA input format
  7. QAer (BertQA)
    • train BertForQuestionAnswering model

Functions

  1. Scraper
  2. Converter
    • pdf_converter: convert PDFs to Dataframe with columns as title & paragraphs
    • df2squad: Converts a pandas dataframe with columns [‘title’, ‘paragraphs’] to a json file with SQuAD format.
    • generate_squad_examples:
  3. text_transformer (BM25)
  4. BM25Vectorizer
    • Convert a collection of raw documents to a matrix of BM25 features and computes scores of the documents based on a query
  5. Retriever
    • get_best_idx_scores_bm25:
    • expand_paragraphs:
  6. Squader
    • bertprocessor(class): convert SQuAD examples to BertQA input format
  7. QAer (BertQA)
    • train BertForQuestionAnswering model