Demystifying Social Media for Data Scientists
For the content creator yearning to translate their expertise into captivating social media content, the landscape can feel daunting. Add the complex and quickly evolving field of data science and Artificial Intelligence to the mix, and data scientists striving to build their own personal brands often struggle to be heard. Countless hours spent crafting insightful blog posts often vanish into the echo chamber of online algorithms. But, being data scientists, we have some of our own tricks that we can leverage to help tackle this challenging content creation landscape! With recent advances in AI, a powerful new tool emerges – the RAG system.
RAGs Not rags
RAGs are the new hot thing in generative AI and like rags in a washing machine, they work by helping to clean up generative AI's understanding of information it did not have access to when trained. RAG stands for retrieval augmented generation, and the idea is to use current or proprietary information to help add more relevant context to generative AI prompts.
In this way, RAGs not only help to clean up the knowledge of GenAI outputs but they also help to refine and polish the Content Creation cycle ensuring content is more relevant.
But the analogy stops there, as unlike actual cleaning rags, RAG systems not only help to sanitize GenAI output but also serve as a way of guiding GenAI to produce more accurate responses. And because these systems can update as new information becomes available, RAG systems enable GenAI to continuously learn and improve. Thus, RAG systems serve more as a collaborator with GenAI than a way of scrubbing dirty content. Imagine a system that acts as your digital curator and creative collaborator.
To demonstrate the power of these systems, I developed a RAG project that uses my existing blog posts and a curated RSS feed of trending Data Science topics to create social media posts that both bolster my own online work and add relevance to current trends in data science. Herein, my RAG system becomes my genie in a digital lamp, conjuring up engaging social media content tailored to resonate with my audience.
The magic lies in the interplay of two intelligent models:
- The Retrieval Model: Think of it as a digital Dewey Decimal System. This model meticulously combs through my blog posts, extracting keywords and identifying the essence of your insights.
- The Generative Model: A powerful wordsmith, trained on the art of social media engagement. This model, armed with the retrieved knowledge and its understanding of current trends, weaves my expertise into captivating narratives for various platforms.
My Social Media RAG Project
Here is the general project outline used to develop my social media RAG system:
- Read my online article text files to python strings
- Generate embeddings for chunks of content and store in a vector database
- Pull current data from data science RSS feed
- Leverage generative AI to summarize RSS feed into current trends in data science
- Grab an article title from RSS feed and match against similar snippet of my own writing using vector database search
- Combine steps 4 and 5 into a social media prompt for generative AI
Environment Details
- Windows 11
- Python==3.11
Step 1: Read Data
When I was first starting out in the content creation space, someone sang a useful jingle to me that went something like this:
"Everything is content…" (sang to the tune of "Everything is Awesome")
And this tune stuck with me. As a result, I made sure to develop an organized repository of every piece of online content I ever created. For this project, that meant I had a directory full of text documents containing all the content I have ever published on Medium (roughly 64 articles to date and counting).
Thus, the first step in my RAG system required reading these text files into memory and performing a very simple cleaning step. Here's the code:
from glob import glob
from typing import List, Dict
def load_user_files(directory: str) -> List[str]:
files = glob(directory+"/*/*.txt")
texts = []
for file in files:
with open(file, "r", errors="ignore") as f:
texts.append(f.read())
return texts, files
def clean_texts(texts: List[str]) -> List[str]:
processed_texts = []
for text in texts:
# Use appropriate cleaning steps here (e.g., punctuation removal, lowercase)
processed_text = text.lower()
processed_texts.append(processed_text)
return processed_texts
# Read text and return text with file paths:
user_corpus, files = load_user_files("./medium/Published")
user_corpus = clean_texts(user_corpus)
Step 2: Chunk Vectors
Once the files were brought into memory, the next step was to chunk the content to develop chunk embeddings and store them into a vector database. Embedding is the process of converting text into numerical representations that help to capture the meaning of the text in a machine-readable format. For more on embeddings here is just one of many example explanations:
With the rise and success of RAG systems in advancing the capabilities of generative AI, there are several vector database solutions that have come to market. These databases are optimized to store vector embeddings in a way that allow for highly efficient searching of the documents that have been stored as embeddings. The general idea is that these databases use vector similarity search metrics like cosine similarity or FAISS to efficiently match one text with another.
In this case, we want to be able to match titles of currently trending data science articles with similar content that I have written about in the past. To accomplish this task, I first needed to derive a means for generating chunks and associated embeddings to store them in my own, search-optimized vector database.
Because the scope of my project is small and ideal for my own personal use, I needed a lightweight vector database that was as easy to implement as possible from my local setup. Enter, VectorDB (original, I know