Chatbot Note
Flow
predict
pdf converts to dataframe.
dataframe expends.
sent to best score calculator(calculate the query and each paragraph’s similarity) which includes vectorizer(will call tf-idf/bm25 transformer) which will generate score matrix. to get best n matched context&query
generate_squad_examples - [{‘title’: ‘2018_2019_annual’, ‘paragraphs’: [{‘context’: ‘Group total revenue decreased by 16% to $8,415 million (2017⁄18: $9,988 million)‘, ‘qas’: [{‘answers’: [], ‘question’: ‘What is the revenue of smartone’, ‘id’: ‘107ffb72-042a-4918-882c-6ee7ae601292’, ‘retriever_score’: array([9.10097739])}]}]},
examples, features = bertProcessor(X=squad_examples,is_training=False)
reader = joblib.load(‘./models1/bert_qa.joblib’)
prediction = reader.predict( X=(examples, features), )
train
download squad dataset
train_processor = BertProcessor(bert_model=“bert-base-uncased”) train_examples, train_features = train_processor.fit_transform(X=‘./data/SQuAD_1.1/train-v1.1.json’)
reader = BertQA(train_batch_size=12, bert_model=“bert-base-uncased”, learning_rate=3e-5, num_train_epochs=2, do_lower_case=True, output_dir=‘models’, verbose_logging=True )
reader.fit(X=(train_examples, train_features))
reader.model.to(‘cuda’) reader.device = torch.device(‘cuda’)
joblib.dump(reader, ‘/content/gdrive/My Drive/bert_qa_cuda.joblib’)
- Scraper
- financial annual report
- Converter
- convert PDFs to Dataframe with columns as title & paragraphs
- text_transformer (BM25)
- BM25Vectorizer
- Convert a collection of raw documents to a matrix of BM25 features and computes scores of the documents based on a query
- Retriever (BM25)
- train a matrix based on BM25 statistics from a corpus of documents then finds the most N similar documents of a given input query by computing the BM25 score for each document based on the query.(The higher BM25 score, the higher relevant)
- Squader/Transformer (bertprocessor)
- convert SQuAD examples to BertQA input format
- QAer (BertQA)
- train BertForQuestionAnswering model
Functions
- Scraper
- Converter
- pdf_converter: convert PDFs to Dataframe with columns as title & paragraphs
- df2squad: Converts a pandas dataframe with columns [‘title’, ‘paragraphs’] to a json file with SQuAD format.
- generate_squad_examples:
- text_transformer (BM25)
- BM25Vectorizer
- Convert a collection of raw documents to a matrix of BM25 features and computes scores of the documents based on a query
- Retriever
- get_best_idx_scores_bm25:
- expand_paragraphs:
- Squader
- bertprocessor(class): convert SQuAD examples to BertQA input format
- QAer (BertQA)
- train BertForQuestionAnswering model