Skip to content

This repo is a collection of various PoCs (Proof-of-Concepts) to interface custom data using LLMs.

License

Notifications You must be signed in to change notification settings

yogeshhk/Sarvadnya

Repository files navigation

Sarvadnya (सर्वज्ञ), an All-Knowing Chatbot!!

Chatbots can be real WoW!! The recent evidence is: ChatGPT. Now that they are more human-like with the latest LLMs (Large Language Models). But these LLMs are Pretrained on their own (HUGE) data. Mere mortals don't have any ways ($$, time, expertise) to train own LLMs. RAG and/or Fine-tuning is the way out for Domain Adaptation ie. LLMs answering on your corpus. This repo is a collection of various PoCs (Proof-of-Concepts) to interface custom data using LLMs.

A few other topics are (or can be) part of this repo is to build

  • Indic-languages models, some notes here
  • 3D World Simulations, Agents, some notes here
  • Knowledge Graphs Generation, some notes here
  • Agents, some notes here
  • Drones, UAV Image Processing, Shynakshi here
  • Floor Plan Segmentation here

What?

PoCs Projects

  • Prep chatbots of various modalities, use cases and domains, diff datasets
  • Prep videos, write Medium Posts (GDE/TH), LinkedIn posts, Youtube channel

Modes

  • Retrieval Augmented Generation (RAG) on own data
  • Fine-tuning LLMs with own data using LoRA etc

RAG

  • When?: {less, streaming, private} data and less {compute, money, expertise}
  • What?:
    • on knowledge graphs, more grounding
    • tabular financial data, representation and similarity
    • midcurveNN Geometric serialization and retrieval
    • active loop idea of fine-tuning your data
    • Langchain and Llamaindex with any new LLM

Fine-Tuning

  • When? Sufficient curated date is available, not a whole lot though, in a batch (not running) state

  • What: Instead of unstructured text (input prompts) to unstructured text (output response), more value is in prompt to structured output, such as :

    • text2json: many enterprises such as financial companies.
    • text2cypher: for graph databases, from Neo4j, like Langchain implementation by Tomaz Britanic
    • text2SQL: classical case, many pro solutions available, study them, follow them, for other QLs
    • text2Manim: Maths Animation, dataset available, see if generated video can be shown in the same streamlit page
    • text23DJS: Good for 3D+LLM+Agents like Metamorph from Nvidia, Geometry or shape representation as text, is the key
    • textGraph2textGraph: MidcurveNN if we get Graph representation as text, right.
  • Here, key would be robust post-processing and evaluation as the response needs to be near perfect, no scope of relaxation even in syntax or format.

Tech Stacks

  • Enterprise: Google Doc AI, Vertex AI, Microsoft Azure Language AI Services
  • Open Source: Langchain (Serve/Smith/Graph), HuggingFace, Streamlit for UI

Bottom-line

  • Not looking for Success, but Wonder!!
  • तमसो मा ज्योतिर्गमय : From Dark (hidden in text data) to Light (insights)

Folks to Follow

Publications so far

References

Disclaimer:

Author (yogeshkulkarni@yahoo.com) gives no guarantee of the results of the program. It is just a fun script. Lot of improvements are still to be made. So, don’t depend on it at all.

About

This repo is a collection of various PoCs (Proof-of-Concepts) to interface custom data using LLMs.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages