Home Articles Research Developing AI search engine using Python.

Developing AI search engine using Python.

June 7, 2024

1386

Developing an AI-powered search engine using Python involves multiple steps, from data collection and processing to creating a machine learning model and developing a search interface. Below is a high-level overview and example code snippets for each step.

1. Data Collection

You need a dataset of documents to search through. For simplicity, let’s assume you have a collection of text documents.

2. Preprocessing

Preprocessing includes cleaning the text, tokenizing, removing stop words, and stemming/lemmatizing.

pythonCopy codeimport nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Remove stopwords
    tokens = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]
    # Stemming
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]
    return ' '.join(tokens)

# Example usage
documents = ["This is the first document.", "This document is the second document.", "And this is the third one."]
preprocessed_documents = [preprocess_text(doc) for doc in documents]
print(preprocessed_documents)

3. Vectorization

Convert text data into numerical vectors. TF-IDF (Term Frequency-Inverse Document Frequency) is commonly used for this purpose.

pythonCopy codefrom sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)
print(tfidf_matrix)

4. Model Building

Use a machine learning model for similarity search. Cosine similarity is a popular choice for finding similarities between documents.

pythonCopy codefrom sklearn.metrics.pairwise import cosine_similarity

def search(query, tfidf_matrix, vectorizer):
    query_vector = vectorizer.transform([preprocess_text(query)])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-6:-1]
    return related_docs_indices

# Example usage
query = "first document"
related_docs_indices = search(query, tfidf_matrix, vectorizer)
print(related_docs_indices)

5. User Interface

Create a simple interface to enter search queries and display results. For simplicity, let’s use a command-line interface.

pythonCopy codedef main():
    while True:
        query = input("Enter your search query (or 'exit' to quit): ")
        if query.lower() == 'exit':
            break
        related_docs_indices = search(query, tfidf_matrix, vectorizer)
        print("Top documents:")
        for idx in related_docs_indices:
            print(f"Document {idx + 1}: {documents[idx]}")

if __name__ == "__main__":
    main()

Putting It All Together

Here’s the complete code combining all the steps above:

pythonCopy codeimport nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Remove stopwords
    tokens = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]
    # Stemming
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]
    return ' '.join(tokens)

def search(query, tfidf_matrix, vectorizer):
    query_vector = vectorizer.transform([preprocess_text(query)])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-6:-1]
    return related_docs_indices

def main():
    documents = [
        "This is the first document.",
        "This document is the second document.",
        "And this is the third one."
    ]
    preprocessed_documents = [preprocess_text(doc) for doc in documents]
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)

    while True:
        query = input("Enter your search query (or 'exit' to quit): ")
        if query.lower() == 'exit':
            break
        related_docs_indices = search(query, tfidf_matrix, vectorizer)
        print("Top documents:")
        for idx in related_docs_indices:
            print(f"Document {idx + 1}: {documents[idx]}")

if __name__ == "__main__":
    main()

Advanced Considerations

Scalability: For larger datasets, consider using more advanced vectorization techniques like word embeddings (e.g., Word2Vec, GloVe) or sentence embeddings (e.g., BERT, Sentence-BERT).
Search Engine Features: Implement features such as autocomplete, spell check, and query expansion to improve the user experience.
Web Interface: Develop a web-based interface using frameworks like Flask or Django for a more user-friendly experience.
Ranking Algorithms: Explore advanced ranking algorithms to improve the relevance of search results.

This example provides a foundational understanding of creating a simple AI-powered search engine in Python. You can expand and customize it based on your specific requirements and dataset.