This Banner is For Sale !!
Get your ad here for a week in 20$ only and get upto 15k traffic Daily!!!

Parsing Parquet stored in S3 with Go


I do know it’s 2023, however you’ll be able to’t get away from processing recordsdata. In a world of Occasions, APIs and Sockets, recordsdata nonetheless exist as a medium for shifting knowledge round. And a quite common one at that. In recent times I’ve discovered myself coping with Apache Parquet format recordsdata. And extra particularly I usually find yourself coping with them popping out of AWS S3. If you’re a client at all the AWS DMS product when replicating, you will see that out that parquet format is an effective way to take care of your knowledge as its designed for environment friendly storage and retrieval. There aren’t too many choices for parsing a parquet file with Golang, however I’ve discover a library I actually take pleasure in and the article under will describe methods to make the most effective use of it.

As at all times, right here is the hyperlink to the Github Repository if you wish to skip forward



What’s Apache Parquet

Apache Parquet is an open supply, column-oriented knowledge file format designed for environment friendly knowledge storage and retrieval. It supplies environment friendly knowledge compression and encoding schemes with enhanced efficiency to deal with complicated knowledge in bulk. Parquet is obtainable in a number of languages together with Java, C++, Python, and many others…

https://parquet.apache.org/



Downloading the Parquet File

For working with S3, I actually just like the Golang library referred to as s3manager. Right here is the SDK documentation. What I like about it’s that may be a larger degree abstraction on high of the traditional S3 library. As an illustration, to obtain a file from a bucket, you merely do one thing like this

downloader := s3manager.NewDownloader(sess)
_, err = downloader.DownloadWithContext(ctx, file,
    &s3.GetObjectInput{
        Bucket: aws.String(bucket),
        Key:    aws.String(key),
    })
Enter fullscreen mode

Exit fullscreen mode

The downloader will put the file within the path you specify within the DownloadWithContext technique within the “file” parameter. It’s only a string.



Parsing File with Golang

Parsing an Apache parquet file with Golang will appear tremendous household to different interface based mostly unmarshalling like DyanamoDB in addition to JSON. For similarities with DDB, you’ll be able to see how to do that within the referenced article

The parse operate seems like this

func ParseFile(fileName string) ([]ParquetUser, error) {
    fr, err := ground.NewFileReader(fileName)
    var fileContent []ParquetUser
    if err != nil {
        return nil, err
    }

    for fr.Subsequent() {
        rec := &ParquetUser{}
        if err := fr.Scan(rec); err != nil {
            // proceed alongside is it is only a malformed row
            if errors.Is(err, ErrIllegalRow) {
                proceed
            }
            return nil, err
        }

        fileContent = append(fileContent, *rec)
    }

    return fileContent, nil
}
Enter fullscreen mode

Exit fullscreen mode

First off, discover that I open a FileReader from the parquet-go library.

From there, I create a slice for holding the output of what’s being unmarshalled.

Then we loop and scan. And for every name to Scan, the unmarshall technique that implements the parquet-go interface is named. That technique seems like this

func (r *ParquetUser) UnmarshalParquet(obj interfaces.UnmarshalObject) error {
    id, err := obj.GetField("id").Int32()

    if err != nil {
        return errors.New(fmt.Sprintf("error unmarshalling row on area (id)"))
    }

    firstName, err := obj.GetField("firstName").ByteArray()

    if err != nil {
        return errors.New(fmt.Sprintf("error unmarshalling row on area (firstName)"))
    }

    lastName, err := obj.GetField("lastName").ByteArray()

    if err != nil {
        return errors.New(fmt.Sprintf("error unmarshalling row on area (lastName)"))
    }

    function, err := obj.GetField("function").ByteArray()

    if err != nil {
        return errors.New(fmt.Sprintf("error unmarshalling row on area (function)"))
    }

    // observe it is a time.Time however comes throughout as an Int64
    lastUpdated, err := obj.GetField("lastUpdated").Int64()

    if err != nil {
        return errors.New(fmt.Sprintf("error unmarshalling row on area (lastUpdated)"))
    }

    parsed := time.UnixMicro(lastUpdated)

    if err != nil {
        log.WithFields(log.Fields{
            "err": err,
        }).Error("error parsing time")
        return errors.New(fmt.Sprintf("(lastUpdated) isn't in the fitting format"))
    }

    r.Id = int(id)
    r.FirstName = string(firstName)
    r.LastName = string(lastName)
    r.Position = string(function)
    r.LastUpdated = parsed
    return nil
}
Enter fullscreen mode

Exit fullscreen mode

Actually not an excessive amount of happening up there outdoors of fetching fields after which placing them into the structs fields. The one principal factor to level out that may be a “gotcha” is that the LastUpdated area is a time.Time. The parquet-go library treats time as an Int64. Observe this line for changing what comes out of the library right into a time.Time

parsed := time.UnixMicro(lastUpdated)



Operating the Program

From there, it’s only a matter of placing all of it collectively. Right here’s the physique of principal

func principal() {
    file, err := DownloadFile(context.TODO(), sess, bucket, key)
    if err != nil {
        log.WithFields(log.Fields{
            "err": err,
        }).Error("error downloading the file")
    }

    contents, err := ParseFile(file)
    if err != nil {
        log.WithFields(log.Fields{
            "err": err,
        }).Error("error parsing the file")
    }

    err = DeleteFile(file)
    if err != nil {
        log.WithFields(log.Fields{
            "err": err,
        }).Error("error deleting the file")
    }

    for _, c := vary contents {
        log.WithFields(log.Fields{
            "file": c,
        }).Debug("printing the file")
    }
}
Enter fullscreen mode

Exit fullscreen mode

In a nutshell …

  • Obtain the file
  • Parse the file
  • Delete the file
  • Loop and print output



parsing outputUseful Suggestions

  1. I’m utilizing VSCode much more lately and I’m kind of weaning myself off of Goland. So that you’ll discover a launch.json file within the .vscode listing. There you’ll be able to set the atmosphere variables it’s worthwhile to run this system
  2. Viewing parquet recordsdata is mostly a ache I’ve discovered. There are few instruments that I’ve preferred. On-line viewers get in the best way of my workflow. BUT I discovered this VSCode plugin to be FANTASTIC. Here is the link to the marketplace



Wrapping Up

Hopefully you discovered this beneficial. Like I discussed to start with, recordsdata aren’t going away as a knowledge medium. And Apache’s Parquet is a superb one once you take care of bigger datasets and it’ll be one of many choices you’ll be able to select when replicating with DMS because the output.

I proceed to simply love Golang’s simplicity and efficiency in addition to the event expertise. The parquet-go library has a couple of quirks however general, 5-star ranking for me.

The Article was Inspired from tech community site.
Contact us if this is inspired from your article and we will give you credit for it for serving the community.

This Banner is For Sale !!
Get your ad here for a week in 20$ only and get upto 10k Tech related traffic daily !!!

Leave a Reply

Your email address will not be published. Required fields are marked *

Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?