Mastering JSON Parsing in Python Pandas: 3 Key Scenarios
Written on
Introduction to JSON Normalization in Pandas
In the realm of data science, it's common to encounter semi-structured data formats like JSON. As a leading data processing library in Python, Pandas offers a built-in feature for normalizing JSON data called json_normalize(). However, this article goes beyond merely demonstrating its use. Real-world data is often imperfect, frequently containing missing values in JSON objects that can lead to errors during normalization.
Here, I will outline three scenarios where we attempt to normalize JSON documents that include missing data in Pandas.
Understanding the json_normalize() Method
Before diving into the special cases, let's first examine how to effectively use the json_normalize() method under typical circumstances. Consider the following JSON document:
import pandas as pd
df = pd.DataFrame({
'student_id': [1, 2, 3],
'student_name': ['Alice', 'Bob', 'Chris'],
'student_info': [
{'gender': 'F', 'age': 20},
{'gender': 'M', 'age': 22},
{'gender': 'M', 'age': 33}
]
})
In this example, the student_info field is nested within JSON objects. We can decode it using the json_normalize() method as follows:
pd.json_normalize(df['student_info'])
This method is quite efficient. It automatically identifies the attributes within the JSON objects and transforms them into column names, subsequently extracting their respective values into a tabular format. To reintegrate this data back into the original DataFrame, we can employ the concat() method:
df = pd.concat([
df.drop('student_info', axis=1),
pd.json_normalize(df['student_info'])],
axis=1
])
Case 1 — Handling JSON as a Dictionary
Now, let's explore scenarios that involve missing data in JSON objects. In our first case, assume the JSON data is already structured as a Python dictionary:
df = pd.DataFrame({
'student_id': [1, 2, 3],
'student_name': ['Alice', 'Bob', 'Chris'],
'student_info': [
{'gender': 'F', 'age': 20},
{'gender': 'M', 'age': 22},
None
]
})
Here, we simulate missing data for the student "Chris," who lacks student_info. Attempting to use the json_normalize() method here would result in an error. Instead of dropping these rows with dropna(), we want to retain other useful fields. Therefore, we should replace the None entry in student_info with an empty dictionary.
df['student_info'].fillna({})
This approach, however, does not work because the fillna() method expects a scalar or a collection type that can specify which cells to fill. Thus, we need to construct a dictionary based on the DataFrame's indices:
df['student_info'].fillna({i: {} for i in df.index})
Once this adjustment is made, we can successfully apply the json_normalize() method.
The first video demonstrates how to wrangle JSON columns in Pandas DataFrames efficiently.
Important Note on Pandas Version
This section clarifies that the issue discussed here has been addressed in Pandas version 1.3.0. The error mentioned will only occur in versions 1.2.5 or earlier.
Case 2 — Dealing with JSON as a String
In the second case, let's assume our JSON documents are in string format, a common scenario when data is read from CSV files originating from NoSQL databases:
df = pd.DataFrame({
'student_id': [1, 2, 3],
'student_name': ['Alice', 'Bob', 'Chris'],
'student_info': [
'{"gender": "F", "age": 20}',
'{"gender": "M", "age": 22}',
None
]
})
To convert these JSON strings into Python dictionaries, we can use the built-in JSON library. While the json.loads() method could be applied within pandas.apply(), a simpler approach is to utilize the literal_eval() method from the Abstract Syntax Trees (AST) library. This method is safer than eval() as it only evaluates specific data types.
from ast import literal_eval
my_dict = literal_eval('{"gender": "F", "age": 20}')
Next, we apply literal_eval() to convert the JSON strings, noting that we still have None values:
df['student_info'].apply(literal_eval)
To resolve the None values, we replace them with an empty string formatted as '{}':
df['student_info'] = df['student_info'].fillna('{}')
df = pd.concat([
df.drop('student_info', axis=1),
pd.json_normalize(df['student_info'].apply(literal_eval))], axis=1)
The second video illustrates how to read single and multiple JSON files into Pandas DataFrames effectively.
Case 3 — JSON Arrays as Strings
Lastly, it's not unusual for JSON array objects to be represented as strings. If we can convert them into Python lists, the process becomes much simpler. Consider the following DataFrame:
df = pd.DataFrame({
'student_id': [1, 2, 3],
'student_name': ['Alice', 'Bob', 'Chris'],
'student_exam': [
'[{"math": 90, "english": 85}]',
'[{"math": 85}]',
None
]
})
In this scenario, we need to transform the string representations of the arrays. Again, we will use the literal_eval() method and fill any None values with an empty array string '[]':
df['student_exam'] = df['student_exam'].fillna('[]').apply(literal_eval)
Next, we can extract values from the lists using the explode() method, which will duplicate rows for items in lists, ensuring each item appears in its own row. To manage the DataFrame index correctly, we follow it up with reset_index():
df = pd.concat([
df.drop('student_exam', axis=1),
pd.json_normalize(df['student_exam'])],
axis=1
)
Conclusion
This article has explored the powerful json_normalize() method in the Pandas library, highlighting its ease of use and efficiency. However, we also discussed its limitations in handling special cases involving NULL values in JSON documents. By addressing these three scenarios, we can effectively manage and normalize JSON data in Pandas.
If you find my articles helpful, consider supporting me and other writers through Medium Membership! (Click the link above)