Update README.md

This commit is contained in:
Saifeddine ALOUI 2024-12-03 08:49:27 +01:00 committed by GitHub
parent fee9adfcfb
commit 004fdc8941
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -1,157 +1,42 @@
# LoLLMs Anything to Markdown Library # LoLLMs Anything to Markdown Library
## Overview ## Overview
JavaScript library to convert various file types to Markdown.
The LoLLMs Anything to Markdown Library is a versatile JavaScript tool designed to convert various file types into Markdown format. This library simplifies the process of extracting text content from different file formats and presenting it in a universally readable Markdown structure. ## Key Features
- Supports: txt, docx, pdf, pptx, and more
## Features - Asynchronous processing
- Object-oriented design
- Supports multiple file formats including plain text, DOCX, PDF, and PPTX
- Asynchronous file processing
- Object-oriented design for easy extensibility
- Basic Markdown conversion with room for customization
- Error handling for unsupported file types and processing errors
## Installation
To use the LoLLMs Anything to Markdown Library, include the following script in your HTML file:
## Import
```html ```html
<script src="path/to/lollms-anything-to-markdown"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/mammoth/1.6.0/mammoth.browser.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js"></script><!-- Required for pdf use -->
<script src="/lollms_assets/js/lollms_anything_to_markdown"></script>
``` ```
Make sure to also include the necessary dependencies for processing specific file types: ## Core Class: LollmsFileLoader
- For DOCX: [Mammoth.js](https://github.com/mwilliamson/mammoth.js) ### Methods
- For PDF: [PDF.js](https://mozilla.github.io/pdf.js/) - `loadFile(file)`: Main method to process files
- For PPTX: A custom PptxTextExtractor library (not provided in this documentation) - `readTextFile(file)`, `readDocxFile(file)`, `readPdfFile(file)`, `readPptxFile(file)`: Type-specific readers
- `convertToMarkdown(content, fileExtension)`: Converts content to Markdown
## Usage ## Usage
### Basic Usage
```javascript ```javascript
const lollmsFileLoader = new LollmsFileLoader(); const lollmsFileLoader = new LollmsFileLoader();
const markdown = await lollmsFileLoader.loadFile(file);
async function handleFileUpload(event) {
const file = event.target.files[0];
if (!file) return;
try {
const markdown = await lollmsFileLoader.loadFile(file);
console.log(markdown);
// Use the markdown content as needed
} catch (error) {
console.error('Error processing file:', error);
alert('Error processing file: ' + error.message);
}
}
``` ```
### Supported File Types ## Extensibility
- Add new file types by creating reader methods
The library supports the following file extensions: - Enhance Markdown conversion logic
- Implement caching or post-processing
- Plain text: txt, md, markdown, rtf, log, csv, json, xml, html, htm, css, js, py, java, c, cpp
- Microsoft Word: docx
- PDF: pdf
- Microsoft PowerPoint: pptx
## API Reference
### LollmsFileLoader Class
The main class of the library, responsible for loading and converting files.
#### Methods
##### `constructor()`
Initializes a new LollmsFileLoader instance.
##### `async loadFile(file)`
Loads and converts a file to Markdown.
- Parameters:
- `file`: File object to be processed
- Returns: A Promise that resolves with the Markdown content of the file
- Throws: Error if the file type is unsupported or if processing fails
##### `readTextFile(file)`
Reads the content of a text-based file.
- Parameters:
- `file`: File object to be read
- Returns: A Promise that resolves with the text content of the file
##### `readDocxFile(file)`
Extracts text content from a DOCX file.
- Parameters:
- `file`: DOCX File object to be processed
- Returns: A Promise that resolves with the extracted text content
##### `readPdfFile(file)`
Extracts text content from a PDF file.
- Parameters:
- `file`: PDF File object to be processed
- Returns: A Promise that resolves with the extracted text content
##### `readPptxFile(file)`
Extracts text content from a PPTX file.
- Parameters:
- `file`: PPTX File object to be processed
- Returns: A Promise that resolves with the extracted text content
##### `convertToMarkdown(content, fileExtension)`
Converts the extracted content to Markdown format.
- Parameters:
- `content`: String containing the extracted text content
- `fileExtension`: String representing the original file's extension
- Returns: A string containing the Markdown-formatted content
## Extending the Library
The LoLLMs Anything to Markdown Library is designed to be easily extensible. Here are some ways you can extend its functionality:
1. Add support for new file types by creating new read methods and adding the file extension to the `supportedExtensions` array.
2. Enhance the Markdown conversion logic in the `convertToMarkdown` method to handle more complex document structures.
3. Implement additional post-processing steps for specific file types.
4. Add a caching mechanism to store processed files for quicker access.
## Error Handling ## Error Handling
Uses Promise-based approach. Wrap `loadFile` in try-catch.
The library uses a Promise-based approach for error handling. Errors are thrown in the following scenarios: ## Dependencies
Requires external libraries for DOCX, PDF, and PPTX processing.
- Unsupported file type This concise documentation provides the essential information for an LLM-based developer to understand and work with the library, while saving context tokens.
- Failure to read or process a file
- Missing dependencies (e.g., PptxTextExtractor for PPTX files)
It's recommended to wrap the `loadFile` method call in a try-catch block to handle these errors gracefully in your application.
## Limitations
- The current implementation provides basic Markdown conversion and may not capture all formatting details from complex documents.
- Processing large files, especially PDFs with many pages, may be time-consuming.
- The library relies on external dependencies for processing DOCX, PDF, and PPTX files, which need to be included separately.
## Contributing
Contributions to the LoLLMs Anything to Markdown Library are welcome. Please ensure that your code adheres to the existing style and includes appropriate test coverage for new features or bug fixes.
## License
[Specify the license under which the library is released, e.g., MIT, Apache 2.0, etc.]
---
This documentation provides a comprehensive overview of the LoLLMs Anything to Markdown Library, including its features, usage instructions, API reference, and guidelines for extension. You can further customize this documentation to fit the specific needs and policies of your project.