ANTLR HTML Structure Parser - Project Report

This report documents the design, grammar definition, custom visitor implementation, and execution verification of the ANTLR HTML Structure Parser project, modeled after a responsive Student Management System interface.

1. Project Overview & Structure Analysis

The target application is Stellar, a modern, responsive Student Management System dashboard.

HTML Interface Architecture (`student_app.html`)

The page utilizes a blurred glassmorphism theme (backdrop-filter) with modern layouts:

Header (#dashboard-header): Contains the brand identifier (h1) and a real-time filter search bar.
Stats Grid (#statsGrid): Displays key performance indicators using three glass-styled cards:
- Total Students (#cardTotal)
- Average GPA (#cardGpa)
- Honors Students (#cardHonors)
Content Grid (.content-layout): A responsive two-column grid layout containing:
- Add Student Form Panel (#addStudentSection): A form with standard text, email, and numeric inputs to register students.
- Student Table Panel (#studentListSection): A tabular representation of registered student profiles, including avatar circles, GPA color badges, and action buttons (Edit/Delete).

2. ANTLR Grammar Design (`HTMLStructure.g4`)

To capture this hierarchical structure, we defined a custom domain-specific language (DSL) that represents HTML elements in a clean, selector-like syntax.

Below is the complete ANTLR v4 grammar:

grammar HTMLStructure;

// Parser Rules
htmlDoc: element* EOF;

element: tagElement
       | textElement
       ;

tagElement: IDENTIFIER 
            ( '#' id=IDENTIFIER )? 
            ( '.' className+=IDENTIFIER )* 
            ( '[' attributeList ']' )? 
            ( '{' element* '}' )? ;

attributeList: attribute ( ',' attribute )* ;

attribute: name=IDENTIFIER '=' value=STRING ;

textElement: value=STRING ;

// Lexer Rules
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9\-_]* ;

STRING: '"' ( '\\"' | ~[\\"\r\n] )* '"'
      | '\'' ( '\\\'' | ~[\\'\r\n] )* '\''
      ;

WS: [ \t\r\n]+ -> skip ;

LINE_COMMENT: '//' ~[\r\n]* -> skip ;
BLOCK_COMMENT: '/*' .*? '*/' -> skip ;

Key Grammar Design Choices

CSS-Style Selectors: The tagElement rule matches selectors like div#container.card[attr="val"] { ... }.
Labeled Identifiers: We labeled the parser variables (id=IDENTIFIER, className+=IDENTIFIER, name=IDENTIFIER, value=STRING). This instructs ANTLR to expose dedicated getters/lists in the generated contexts (e.g. ctx.id and ctx.className), avoiding ambiguity with the tag name.
Optional Sub-blocks: Both the attributes list [...] and the children block {...} are optional, allowing leaf tags like i.fa-solid to be cleanly declared without brackets or braces.

3. Custom Input Representation (`sample_input.txt`)

We converted the structural model of the student_app.html mockup into a DSL document matching the grammar:

div#dashboard-container.dashboard-container {
    header#dashboard-header {
        div.brand-section {
            h1 {
                i.fa-solid.fa-graduation-cap
                "Stellar"
            }
            p { "Student Performance Management Dashboard" }
        }
        div.search-bar {
            i.fa-solid.fa-magnifying-glass
            input#searchInput[placeholder="Search students by name or email...", type="text"]
        }
    }
    
    section#statsGrid.stats-grid {
        ...
    }
    
    div.content-layout {
        section#addStudentSection.panel {
            form#studentForm {
                div.form-group {
                    label[for="studentName"] { "Full Name" }
                    input#studentName.form-control[placeholder="e.g. Alice Smith", required="true"]
                }
                ...
            }
        }
        ...
    }
}

4. Java Implementation

The Java implementation parses the DSL input, constructs a parse tree, walks it using the Visitor pattern, prints the hierarchical tree to the console, and collects node stats.

A. Driver Class (`Main.java`)

The driver configures the ANTLR input stream, initializes the lexer and parser, registers custom error handling, and kicks off the tree walk.

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        String inputFile = "sample_input.txt";
        if (args.length > 0) {
            inputFile = args[0];
        }

        System.out.println("Processing input file: " + inputFile);

        try {
            CharStream input = CharStreams.fromFileName(inputFile);
            HTMLStructureLexer lexer = new HTMLStructureLexer(input);
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            HTMLStructureParser parser = new HTMLStructureParser(tokens);

            parser.removeErrorListeners();
            parser.addErrorListener(new BaseErrorListener() {
                @Override
                public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol,
                                        int line, int charPositionInLine, String msg,
                                        RecognitionException e) {
                    System.err.println("Syntax Error at line " + line + ":" + charPositionInLine + " - " + msg);
                }
            });

            ParseTree tree = parser.htmlDoc();

            if (parser.getNumberOfSyntaxErrors() > 0) {
                System.err.println("Parsing finished with errors. Exiting.");
                System.exit(1);
            }

            HTMLStructureBaseVisitorExtended visitor = new HTMLStructureBaseVisitorExtended();
            visitor.visit(tree);

        } catch (IOException e) {
            System.err.println("Error reading input file: " + e.getMessage());
        }
    }
}

B. Custom Visitor (`HTMLStructureBaseVisitorExtended.java`)

This visitor extends the generated HTMLStructureBaseVisitor<Void> class to perform visual formatting and track counts.

import java.util.List;
import java.util.ArrayList;
import java.util.stream.Collectors;
import org.antlr.v4.runtime.Token;

public class HTMLStructureBaseVisitorExtended extends HTMLStructureBaseVisitor<Void> {
    private int indentLevel = 0;
    private int totalNodes = 0;
    private int tagElementCount = 0;
    private int textElementCount = 0;

    private String getIndent() {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < indentLevel; i++) {
            sb.append("  ");
        }
        return sb.toString();
    }

    @Override
    public Void visitHtmlDoc(HTMLStructureParser.HtmlDocContext ctx) {
        System.out.println("--- Starting Parse Tree Walk ---");
        Void result = super.visitHtmlDoc(ctx);
        System.out.println("--- Parse Tree Walk Completed ---");
        System.out.println("\nStatistics Summary:");
        System.out.println("  Total Nodes Processed: " + totalNodes);
        System.out.println("  Tag Elements Count: " + tagElementCount);
        System.out.println("  Text Elements Count: " + textElementCount);
        return result;
    }

    @Override
    public Void visitTagElement(HTMLStructureParser.TagElementContext ctx) {
        totalNodes++;
        tagElementCount++;

        String tagName = ctx.IDENTIFIER(0).getText();
        String idName = ctx.id != null ? "#" + ctx.id.getText() : "";

        // Collect class names
        String classStr = "";
        if (ctx.className != null && !ctx.className.isEmpty()) {
            classStr = ctx.className.stream()
                    .map(token -> "." + token.getText())
                    .collect(Collectors.joining());
        }

        // Collect attributes
        String attrStr = "";
        if (ctx.attributeList() != null && ctx.attributeList().attribute() != null) {
            List<String> attrs = new ArrayList<>();
            for (HTMLStructureParser.AttributeContext attrCtx : ctx.attributeList().attribute()) {
                attrs.add(attrCtx.name.getText() + "=" + attrCtx.value.getText());
            }
            attrStr = "[" + String.join(", ", attrs) + "]";
        }

        System.out.println(getIndent() + "[Tag] " + tagName + idName + classStr + (attrStr.isEmpty() ? "" : " " + attrStr));

        indentLevel++;
        // Visit children
        if (ctx.element() != null) {
            for (HTMLStructureParser.ElementContext child : ctx.element()) {
                visit(child);
            }
        }
        indentLevel--;

        return null;
    }

    @Override
    public Void visitTextElement(HTMLStructureParser.TextElementContext ctx) {
        totalNodes++;
        textElementCount++;
        String text = ctx.value.getText();
        // Remove surrounding quotes
        if (text.startsWith("\"") && text.endsWith("\"") && text.length() >= 2) {
            text = text.substring(1, text.length() - 1);
        } else if (text.startsWith("'") && text.endsWith("'") && text.length() >= 2) {
            text = text.substring(1, text.length() - 1);
        }
        System.out.println(getIndent() + "[Text] \"" + text + "\"");
        return null;
    }
}

5. Build, Execution, and Verification

Follow these steps to run the parser:

Step 1: Generate Parser Files

Use the downloaded ANTLR jar to generate Java classes:

java -jar antlr-4.13.2-complete.jar -visitor HTMLStructure.g4

Step 2: Compile All Sources

Compile the driver, custom visitor, and all ANTLR-generated source files:

javac -cp ".;antlr-4.13.2-complete.jar" *.java

Step 3: Run the Application

Run the parser program passing the sample input document:

java -cp ".;antlr-4.13.2-complete.jar" Main sample_input.txt

Verification Output

The parser runs and walks the parse tree successfully, producing:

Processing input file: sample_input.txt
--- Starting Parse Tree Walk ---
[Tag] div#dashboard-container.dashboard-container
  [Tag] header#dashboard-header
    [Tag] div.brand-section
      [Tag] h1
        [Tag] i.fa-solid.fa-graduation-cap
        [Text] "Stellar"
      [Tag] p
        [Text] "Student Performance Management Dashboard"
    [Tag] div.search-bar
      [Tag] i.fa-solid.fa-magnifying-glass
      [Tag] input#searchInput [placeholder="Search students by name or email...", type="text"]
  [Tag] section#statsGrid.stats-grid
    [Tag] div#cardTotal.stat-card
      [Tag] div.stat-info
        [Tag] h3
          [Text] "Total Students"
        [Tag] div#statTotalCount.value
          [Text] "3"
        [Tag] div.trend.up
          [Tag] i.fa-solid.fa-arrow-up
          [Text] "+1 New class"
...
--- Parse Tree Walk Completed ---

Statistics Summary:
  Total Nodes Processed: 121
  Tag Elements Count: 92
  Text Elements Count: 29

6. Lessons Learned & Best Practices

Resolving List vs Element in Grammar Actions: In ANTLR parser rules, matching multiple identifiers like className+=IDENTIFIER makes the context class generate a List<Token>. When querying the primary node's identifier (e.g. tag name), using ctx.IDENTIFIER() yields the whole list of matched identifiers instead of the single tag name token. The first token must be explicitly indexed via ctx.IDENTIFIER(0).
Context Labels (id=IDENTIFIER): Explicitly labeling optional parser fields (e.g. id=IDENTIFIER) distinguishes them from collection arrays (className+=IDENTIFIER) and makes custom visitors highly readable and easy to develop.
Visitor vs Listener Pattern: The Visitor pattern is perfect for visual printing because it allows full control over when child nodes are visited. Indentation levels can be easily tracked by incrementing a counter before visiting children and decrementing it afterwards. The Listener pattern, while simpler, runs automatically and makes passing structured layout details or maintaining indentation hierarchies more complex.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
HTMLStructure.g4		HTMLStructure.g4
HTMLStructure.interp		HTMLStructure.interp
HTMLStructure.tokens		HTMLStructure.tokens
HTMLStructureBaseListener.class		HTMLStructureBaseListener.class
HTMLStructureBaseListener.java		HTMLStructureBaseListener.java
HTMLStructureBaseVisitor.class		HTMLStructureBaseVisitor.class
HTMLStructureBaseVisitor.java		HTMLStructureBaseVisitor.java
HTMLStructureBaseVisitorExtended.class		HTMLStructureBaseVisitorExtended.class
HTMLStructureBaseVisitorExtended.java		HTMLStructureBaseVisitorExtended.java
HTMLStructureLexer.class		HTMLStructureLexer.class
HTMLStructureLexer.interp		HTMLStructureLexer.interp
HTMLStructureLexer.java		HTMLStructureLexer.java
HTMLStructureLexer.tokens		HTMLStructureLexer.tokens
HTMLStructureListener.class		HTMLStructureListener.class
HTMLStructureListener.java		HTMLStructureListener.java
HTMLStructureParser$AttributeContext.class		HTMLStructureParser$AttributeContext.class
HTMLStructureParser$AttributeListContext.class		HTMLStructureParser$AttributeListContext.class
HTMLStructureParser$ElementContext.class		HTMLStructureParser$ElementContext.class
HTMLStructureParser$HtmlDocContext.class		HTMLStructureParser$HtmlDocContext.class
HTMLStructureParser$TagElementContext.class		HTMLStructureParser$TagElementContext.class
HTMLStructureParser$TextElementContext.class		HTMLStructureParser$TextElementContext.class
HTMLStructureParser.class		HTMLStructureParser.class
HTMLStructureParser.java		HTMLStructureParser.java
HTMLStructureVisitor.class		HTMLStructureVisitor.class
HTMLStructureVisitor.java		HTMLStructureVisitor.java
Main$1.class		Main$1.class
Main.class		Main.class
Main.java		Main.java
README.md		README.md
antlr-4.13.2-complete.jar		antlr-4.13.2-complete.jar
project_report.md		project_report.md
sample_input.txt		sample_input.txt
student_app.html		student_app.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANTLR HTML Structure Parser - Project Report

1. Project Overview & Structure Analysis

HTML Interface Architecture (`student_app.html`)

2. ANTLR Grammar Design (`HTMLStructure.g4`)

Key Grammar Design Choices

3. Custom Input Representation (`sample_input.txt`)

4. Java Implementation

A. Driver Class (`Main.java`)

B. Custom Visitor (`HTMLStructureBaseVisitorExtended.java`)

5. Build, Execution, and Verification

Step 1: Generate Parser Files

Step 2: Compile All Sources

Step 3: Run the Application

Verification Output

6. Lessons Learned & Best Practices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ANTLR HTML Structure Parser - Project Report

1. Project Overview & Structure Analysis

HTML Interface Architecture (student_app.html)

2. ANTLR Grammar Design (HTMLStructure.g4)

Key Grammar Design Choices

3. Custom Input Representation (sample_input.txt)

4. Java Implementation

A. Driver Class (Main.java)

B. Custom Visitor (HTMLStructureBaseVisitorExtended.java)

5. Build, Execution, and Verification

Step 1: Generate Parser Files

Step 2: Compile All Sources

Step 3: Run the Application

Verification Output

6. Lessons Learned & Best Practices

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

HTML Interface Architecture (`student_app.html`)

2. ANTLR Grammar Design (`HTMLStructure.g4`)

3. Custom Input Representation (`sample_input.txt`)

A. Driver Class (`Main.java`)

B. Custom Visitor (`HTMLStructureBaseVisitorExtended.java`)

Packages