Chatbot Note

2020-02-26 2 minutes read

NLP

#chatbot

views 282 words

Flow

predict

pdf converts to dataframe.

dataframe expends.

sent to best score calculator(calculate the query and each paragraph’s similarity) which includes vectorizer(will call tf-idf/bm25 transformer) which will generate score matrix. to get best n matched context&query

generate_squad_examples - [{‘title’: ‘2018_2019_annual’, ‘paragraphs’: [{‘context’: ‘Group total revenue decreased by 16% to $8,415 million (²⁰¹⁷⁄₁₈: $9,988 million)‘, ‘qas’: [{‘answers’: [], ‘question’: ‘What is the revenue of smartone’, ‘id’: ‘107ffb72-042a-4918-882c-6ee7ae601292’, ‘retriever_score’: array([9.10097739])}]}]},

examples, features = bertProcessor(X=squad_examples,is_training=False)

reader = joblib.load(‘./models1/bert_qa.joblib’)

prediction = reader.predict( X=(examples, features), )

train

download squad dataset

train_processor = BertProcessor(bert_model=“bert-base-uncased”) train_examples, train_features = train_processor.fit_transform(X=‘./data/SQuAD_1.1/train-v1.1.json’)

reader = BertQA(train_batch_size=12, bert_model=“bert-base-uncased”, learning_rate=3e-5, num_train_epochs=2, do_lower_case=True, output_dir=‘models’, verbose_logging=True )

reader.fit(X=(train_examples, train_features))

reader.model.to(‘cuda’) reader.device = torch.device(‘cuda’)

joblib.dump(reader, ‘/content/gdrive/My Drive/bert_qa_cuda.joblib’)

Scraper
- financial annual report
Converter
- convert PDFs to Dataframe with columns as title & paragraphs
text_transformer (BM25)
BM25Vectorizer
- Convert a collection of raw documents to a matrix of BM25 features and computes scores of the documents based on a query
Retriever (BM25)
- train a matrix based on BM25 statistics from a corpus of documents then finds the most N similar documents of a given input query by computing the BM25 score for each document based on the query.(The higher BM25 score, the higher relevant)
Squader/Transformer (bertprocessor)
- convert SQuAD examples to BertQA input format
QAer (BertQA)
- train BertForQuestionAnswering model

Functions

Scraper
Converter
- pdf_converter: convert PDFs to Dataframe with columns as title & paragraphs
- df2squad: Converts a pandas dataframe with columns [‘title’, ‘paragraphs’] to a json file with SQuAD format.
- generate_squad_examples:
text_transformer (BM25)
BM25Vectorizer
- Convert a collection of raw documents to a matrix of BM25 features and computes scores of the documents based on a query
Retriever
- get_best_idx_scores_bm25:
- expand_paragraphs:
Squader
- bertprocessor(class): convert SQuAD examples to BertQA input format
QAer (BertQA)
- train BertForQuestionAnswering model