OpenCompass/tools/tools_needleinahaystack.py

import argparse

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap


class CDMEDataset():

    @staticmethod
    def visualize(path: str, dataset_length: str):
        for file_path in path:
            df = pd.read_csv(file_path)

            df['Context Length'] = df['dataset'].apply(
                lambda x: int(x.split('Length')[1].split('Depth')[0]))
            df['Document Depth'] = df['dataset'].apply(
                lambda x: float(x.split('Depth')[1].split('_')[0]))

            # Exclude 'Context Length' and 'Document Depth' columns
            model_columns = [
                col for col in df.columns
                if col not in ['Context Length', 'Document Depth']
            ]

            for model_name in model_columns[4:]:
                model_df = df[['Document Depth', 'Context Length',
                               model_name]].copy()
                model_df.rename(columns={model_name: 'Score'}, inplace=True)

                # Create pivot table
                pivot_table = pd.pivot_table(model_df,
                                             values='Score',
                                             index=['Document Depth'],
                                             columns=['Context Length'],
                                             aggfunc='mean')

                # Calculate mean scores
                mean_scores = pivot_table.mean().values

                # Calculate overall score
                overall_score = mean_scores.mean()

                # Create heatmap and line plot
                plt.figure(figsize=(15.5, 8))
                ax = plt.gca()
                cmap = LinearSegmentedColormap.from_list(
                    'custom_cmap', ['#F0496E', '#EBB839', '#0CD79F'])

                # Draw heatmap
                sns.heatmap(pivot_table,
                            cmap=cmap,
                            ax=ax,
                            cbar_kws={'label': 'Score'},
                            vmin=0,
                            vmax=100)

                # Set line plot data
                x_data = [i + 0.5 for i in range(len(mean_scores))]
                y_data = mean_scores

                # Create twin axis for line plot
                ax2 = ax.twinx()
                # Draw line plot
                ax2.plot(x_data,
                         y_data,
                         color='white',
                         marker='o',
                         linestyle='-',
                         linewidth=2,
                         markersize=8,
                         label='Average Depth Score')
                # Set y-axis range
                ax2.set_ylim(0, 100)

                # Hide original y-axis ticks and labels
                ax2.set_yticklabels([])
                ax2.set_yticks([])

                # Add legend
                ax2.legend(loc='upper left')

                # Set chart title and labels
                ax.set_title(f'{model_name} {dataset_length} Context '
                             'Performance\nFact Retrieval Across '
                             'Context Lengths ("Needle In A Haystack")')
                ax.set_xlabel('Token Limit')
                ax.set_ylabel('Depth Percent')
                ax.set_xticklabels(pivot_table.columns.values, rotation=45)
                ax.set_yticklabels(pivot_table.index.values, rotation=0)
                # Add overall score as a subtitle
                plt.text(0.5,
                         -0.13, f'Overall Score for {model_name}: '
                         f'{overall_score:.2f}',
                         ha='center',
                         va='center',
                         transform=ax.transAxes,
                         fontsize=13)

                # Save heatmap as PNG
                png_file_path = file_path.replace('.csv', f'_{model_name}.png')
                plt.tight_layout()
                plt.subplots_adjust(right=1)
                plt.draw()
                plt.savefig(png_file_path)
                plt.show()

                plt.close()  # Close figure to prevent memory leaks

                # Print saved PNG file path
                print(f'Heatmap for {model_name} saved as: {png_file_path}')


def main():
    parser = argparse.ArgumentParser(description='Generate NeedleInAHaystack'
                                     'Test Plots')

    parser.add_argument('--path',
                        nargs='*',
                        default=['path/to/your/result.csv'],
                        help='Paths to CSV files for visualization')
    parser.add_argument('--dataset_length',
                        default='8K',
                        type=str,
                        help='Dataset_length for visualization')
    args = parser.parse_args()

    if not args.path:
        print("Error: '--path' is required for visualization.")
        exit(1)
    CDMEDataset.visualize(args.path, args.dataset_length)


if __name__ == '__main__':
    main()
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00			`import argparse`

			`import matplotlib.pyplot as plt`
			`import pandas as pd`
			`import seaborn as sns`
			`from matplotlib.colors import LinearSegmentedColormap`


			`class CDMEDataset():`

			`@staticmethod`
Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`def visualize(path: str, dataset_length: str):`
			`for file_path in path:`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00			`df = pd.read_csv(file_path)`
[Feature] Update plot function in tools_needleinahaystack.py (#747) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-29 18:51:09 +08:00
			`df['Context Length'] = df['dataset'].apply(`
			`lambda x: int(x.split('Length')[1].split('Depth')[0]))`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00			`df['Document Depth'] = df['dataset'].apply(`
[Feature] Update plot function in tools_needleinahaystack.py (#747) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-29 18:51:09 +08:00			`lambda x: float(x.split('Depth')[1].split('_')[0]))`

			`# Exclude 'Context Length' and 'Document Depth' columns`
			`model_columns = [`
			`col for col in df.columns`
			`if col not in ['Context Length', 'Document Depth']`
			`]`

			`for model_name in model_columns[4:]:`
			`model_df = df[['Document Depth', 'Context Length',`
			`model_name]].copy()`
			`model_df.rename(columns={model_name: 'Score'}, inplace=True)`

			`# Create pivot table`
			`pivot_table = pd.pivot_table(model_df,`
			`values='Score',`
			`index=['Document Depth'],`
			`columns=['Context Length'],`
			`aggfunc='mean')`

			`# Calculate mean scores`
			`mean_scores = pivot_table.mean().values`

			`# Calculate overall score`
			`overall_score = mean_scores.mean()`

			`# Create heatmap and line plot`
Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`plt.figure(figsize=(15.5, 8))`
[Feature] Update plot function in tools_needleinahaystack.py (#747) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-29 18:51:09 +08:00			`ax = plt.gca()`
			`cmap = LinearSegmentedColormap.from_list(`
			`'custom_cmap', ['#F0496E', '#EBB839', '#0CD79F'])`

			`# Draw heatmap`
			`sns.heatmap(pivot_table,`
			`cmap=cmap,`
			`ax=ax,`
			`cbar_kws={'label': 'Score'},`
			`vmin=0,`
			`vmax=100)`

			`# Set line plot data`
			`x_data = [i + 0.5 for i in range(len(mean_scores))]`
			`y_data = mean_scores`

			`# Create twin axis for line plot`
			`ax2 = ax.twinx()`
			`# Draw line plot`
			`ax2.plot(x_data,`
			`y_data,`
			`color='white',`
			`marker='o',`
			`linestyle='-',`
			`linewidth=2,`
			`markersize=8,`
			`label='Average Depth Score')`
			`# Set y-axis range`
			`ax2.set_ylim(0, 100)`

			`# Hide original y-axis ticks and labels`
			`ax2.set_yticklabels([])`
			`ax2.set_yticks([])`

			`# Add legend`
			`ax2.legend(loc='upper left')`

			`# Set chart title and labels`
Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`ax.set_title(f'{model_name} {dataset_length} Context '`
			`'Performance\nFact Retrieval Across '`
			`'Context Lengths ("Needle In A Haystack")')`
[Feature] Update plot function in tools_needleinahaystack.py (#747) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-29 18:51:09 +08:00			`ax.set_xlabel('Token Limit')`
			`ax.set_ylabel('Depth Percent')`
			`ax.set_xticklabels(pivot_table.columns.values, rotation=45)`
			`ax.set_yticklabels(pivot_table.index.values, rotation=0)`
			`# Add overall score as a subtitle`
			`plt.text(0.5,`
			`-0.13, f'Overall Score for {model_name}: '`
			`f'{overall_score:.2f}',`
			`ha='center',`
			`va='center',`
			`transform=ax.transAxes,`
			`fontsize=13)`

			`# Save heatmap as PNG`
			`png_file_path = file_path.replace('.csv', f'_{model_name}.png')`
Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`plt.tight_layout()`
			`plt.subplots_adjust(right=1)`
			`plt.draw()`
[Feature] Update plot function in tools_needleinahaystack.py (#747) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-29 18:51:09 +08:00			`plt.savefig(png_file_path)`
			`plt.show()`

			`plt.close() # Close figure to prevent memory leaks`

			`# Print saved PNG file path`
			`print(f'Heatmap for {model_name} saved as: {png_file_path}')`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00

			`def main():`
[Update] Change NeedleInAHaystackDataset to dynamic dataset loading (#754) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-02 17:22:56 +08:00			`parser = argparse.ArgumentParser(description='Generate NeedleInAHaystack'`
			`'Test Plots')`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00
Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`parser.add_argument('--path',`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00			`nargs='*',`
			`default=['path/to/your/result.csv'],`
			`help='Paths to CSV files for visualization')`
Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`parser.add_argument('--dataset_length',`
			`default='8K',`
			`type=str,`
			`help='Dataset_length for visualization')`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00			`args = parser.parse_args()`

Added support for multi-needle testing in needle-in-a-haystack test (#802) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test * update plot function in tools_needleinahaystack.py * optimizing needleinahaystack dataset generation strategy * modify minor formatting issues * add English version support * change NeedleInAHaystackDataset to dynamic loading * change NeedleInAHaystackDataset to dynamic loading * fix needleinahaystack test eval bug * fix needleinahaystack config bug * Added support for multi-needle testing in needle-in-a-haystack test * Optimize the code for plotting in the needle-in-a-haystack test. * Correct the typo in the dataset parameters. * update needleinahaystack test docs --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2024-01-17 13:47:34 +08:00			`if not args.path:`
			`print("Error: '--path' is required for visualization.")`
			`exit(1)`
			`CDMEDataset.visualize(args.path, args.dataset_length)`
[Feature] Add NeedleInAHaystack Test Support (#714) * Add NeedleInAHaystack Test * Apply pre-commit formatting * Update configs/eval_hf_internlm_chat_20b_cdme.py Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> * add needle in haystack test * update needle in haystack test --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> 2023-12-23 12:00:51 +08:00

			`if __name__ == '__main__':`
			`main()`